diff --git "a/data/cvpr2024_papers_with_details.csv" "b/data/cvpr2024_papers_with_details.csv" new file mode 100644--- /dev/null +++ "b/data/cvpr2024_papers_with_details.csv" @@ -0,0 +1,44044 @@ +Title,Authors,Link,arXiv_link,other_link,pdf_path,arXiv_title,summary,primary_category,categories +No title found,No authors listed, ,,https://pine.libguides.com/c.php?g=997445&p=7219661,,,,,nan +CapsFusion: Rethinking Image-Text Data at Scale,Qiying Yu · Quan Sun · Xiaosong Zhang · Yufeng Cui · Yufeng Cui · Fan Zhang · Yue Cao · Xinlong Wang · Jingjing Liu, ,https://arxiv.org/abs/2310.20550,,,CapsFusion: Rethinking Image-Text Data at Scale,"Large multimodal models demonstrate remarkable generalist ability to perform +diverse multimodal tasks in a zero-shot manner. Large-scale web-based +image-text pairs contribute fundamentally to this success, but suffer from +excessive noise. Recent studies use alternative captions synthesized by +captioning models and have achieved notable benchmark performance. However, our +experiments reveal significant Scalability Deficiency and World Knowledge Loss +issues in models trained with synthetic captions, which have been largely +obscured by their initial benchmark success. Upon closer examination, we +identify the root cause as the overly-simplified language structure and lack of +knowledge details in existing synthetic captions. To provide higher-quality and +more scalable multimodal pretraining data, we propose CapsFusion, an advanced +framework that leverages large language models to consolidate and refine +information from both web-based image-text pairs and synthetic captions. +Extensive experiments show that CapsFusion captions exhibit remarkable +all-round superiority over existing captions in terms of model performance +(e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample +efficiency (requiring 11-16 times less computation than baselines), world +knowledge depth, and scalability. These effectiveness, efficiency and +scalability advantages position CapsFusion as a promising candidate for future +scaling of LMM training.",cs.CV,nan +Semantic-Aware Multi-Label Adversarial Attacks,Hassan Mahmood · Ehsan Elhamifar, ,https://arxiv.org/abs/2401.16001,,2401.16001.pdf,LESSON: Multi-Label Adversarial False Data Injection Attack for Deep Learning Locational Detection,"Deep learning methods can not only detect false data injection attacks (FDIA) +but also locate attacks of FDIA. Although adversarial false data injection +attacks (AFDIA) based on deep learning vulnerabilities have been studied in the +field of single-label FDIA detection, the adversarial attack and defense +against multi-label FDIA locational detection are still not involved. To bridge +this gap, this paper first explores the multi-label adversarial example attacks +against multi-label FDIA locational detectors and proposes a general +multi-label adversarial attack framework, namely muLti-labEl adverSarial falSe +data injectiON attack (LESSON). The proposed LESSON attack framework includes +three key designs, namely Perturbing State Variables, Tailored Loss Function +Design, and Change of Variables, which can help find suitable multi-label +adversarial perturbations within the physical constraints to circumvent both +Bad Data Detection (BDD) and Neural Attack Location (NAL). Four typical LESSON +attacks based on the proposed framework and two dimensions of attack objectives +are examined, and the experimental results demonstrate the effectiveness of the +proposed attack framework, posing serious and pressing security concerns in +smart grids.",cs.CR,['cs.CR'] +Towards Better Vision-Inspired Vision-Language Models,Yun-Hao Cao · Kaixiang Ji · Ziyuan Huang · Chuanyang Zheng · Jiajia Liu · Jian Wang · Jingdong Chen · Ming Yang, ,,https://www.youtube.com/watch?v=d91e0EwAIZc,,,,,nan +HINTED: Hard Instance Enhanced Detector with Mixed-Density Feature Fusion for Sparsely-Supervised 3D Object Detection,Qiming Xia · Wei Ye · Hai Wu · Shijia Zhao · Leyuan Xing · Xun Huang · Jinhao Deng · Xin Li · Chenglu Wen · Cheng Wang,https://github.com/xmuqimingxia/HINTED,https://arxiv.org/abs/2308.04556,,2308.04556.pdf,FocalFormer3D : Focusing on Hard Instance for 3D Object Detection,"False negatives (FN) in 3D object detection, {\em e.g.}, missing predictions +of pedestrians, vehicles, or other obstacles, can lead to potentially dangerous +situations in autonomous driving. While being fatal, this issue is understudied +in many current 3D detection methods. In this work, we propose Hard Instance +Probing (HIP), a general pipeline that identifies \textit{FN} in a multi-stage +manner and guides the models to focus on excavating difficult instances. For 3D +object detection, we instantiate this method as FocalFormer3D, a simple yet +effective detector that excels at excavating difficult objects and improving +prediction recall. FocalFormer3D features a multi-stage query generation to +discover hard objects and a box-level transformer decoder to efficiently +distinguish objects from massive object candidates. Experimental results on the +nuScenes and Waymo datasets validate the superior performance of FocalFormer3D. +The advantage leads to strong performance on both detection and tracking, in +both LiDAR and multi-modal settings. Notably, FocalFormer3D achieves a 70.5 mAP +and 73.9 NDS on nuScenes detection benchmark, while the nuScenes tracking +benchmark shows 72.1 AMOTA, both ranking 1st place on the nuScenes LiDAR +leaderboard. Our code is available at +\url{https://github.com/NVlabs/FocalFormer3D}.",cs.CV,['cs.CV'] +"DiG-IN: Diffusion Guidance for Investigating Networks - Uncovering Classifier Differences, Neuron Visualisations, and Visual Counterfactual Explanations",Maximilian Augustin · Yannic Neuhaus · Matthias Hein, ,https://arxiv.org/abs/2311.17833,,2311.17833.pdf,"DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences, Neuron Visualisations, and Visual Counterfactual Explanations","While deep learning has led to huge progress in complex image classification +tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call +into question how reliably these classifiers work in the wild. Furthermore, for +safety-critical tasks the black-box nature of their decisions is problematic, +and explanations or at least methods which make decisions plausible are needed +urgently. In this paper, we address these problems by generating images that +optimize a classifier-derived objective using a framework for guided image +generation. We analyze the decisions of image classifiers by visual +counterfactual explanations (VCEs), detection of systematic mistakes by +analyzing images where classifiers maximally disagree, and visualization of +neurons and spurious features. In this way, we validate existing observations, +e.g. the shape bias of adversarially robust models, as well as novel failure +modes, e.g. systematic errors of zero-shot CLIP classifiers. Moreover, our VCEs +outperform previous work while being more versatile.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models,Fei Deng · Qifei Wang · Wei Wei · Tingbo Hou · Matthias Grundmann, ,https://arxiv.org/abs/2402.08714,,2402.08714.pdf,PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models,"Reward finetuning has emerged as a promising approach to aligning foundation +models with downstream objectives. Remarkable success has been achieved in the +language domain by using reinforcement learning (RL) to maximize rewards that +reflect human preference. However, in the vision domain, existing RL-based +reward finetuning methods are limited by their instability in large-scale +training, rendering them incapable of generalizing to complex, unseen prompts. +In this paper, we propose Proximal Reward Difference Prediction (PRDP), +enabling stable black-box reward finetuning for diffusion models for the first +time on large-scale prompt datasets with over 100K prompts. Our key innovation +is the Reward Difference Prediction (RDP) objective that has the same optimal +solution as the RL objective while enjoying better training stability. +Specifically, the RDP objective is a supervised regression objective that tasks +the diffusion model with predicting the reward difference of generated image +pairs from their denoising trajectories. We theoretically prove that the +diffusion model that obtains perfect reward difference prediction is exactly +the maximizer of the RL objective. We further develop an online algorithm with +proximal updates to stably optimize the RDP objective. In experiments, we +demonstrate that PRDP can match the reward maximization ability of +well-established RL-based methods in small-scale training. Furthermore, through +large-scale training on text prompts from the Human Preference Dataset v2 and +the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a +diverse set of complex, unseen prompts whereas RL-based methods completely +fail.",cs.LG,"['cs.LG', 'cs.AI']" +SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology,Saarthak Kapse · Pushpak Pati · Srijan Das · Jingwei Zhang · Chao Chen · Maria Vakalopoulou · Joel Saltz · Dimitris Samaras · Rajarsi Gupta · Prateek Prasanna,https://github.com/bmi-imaginelab/SI-MIL,https://arxiv.org/abs/2312.15010,,2312.15010.pdf,SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology,"Introducing interpretability and reasoning into Multiple Instance Learning +(MIL) methods for Whole Slide Image (WSI) analysis is challenging, given the +complexity of gigapixel slides. Traditionally, MIL interpretability is limited +to identifying salient regions deemed pertinent for downstream tasks, offering +little insight to the end-user (pathologist) regarding the rationale behind +these selections. To address this, we propose Self-Interpretable MIL (SI-MIL), +a method intrinsically designed for interpretability from the very outset. +SI-MIL employs a deep MIL framework to guide an interpretable branch grounded +on handcrafted pathological features, facilitating linear predictions. Beyond +identifying salient regions, SI-MIL uniquely provides feature-level +interpretations rooted in pathological insights for WSIs. Notably, SI-MIL, with +its linear prediction constraints, challenges the prevalent myth of an +inevitable trade-off between model interpretability and performance, +demonstrating competitive results compared to state-of-the-art methods on +WSI-level prediction tasks across three cancer types. In addition, we +thoroughly benchmark the local and global-interpretability of SI-MIL in terms +of statistical analysis, a domain expert study, and desiderata of +interpretability, namely, user-friendliness and faithfulness.",cs.CV,['cs.CV'] +Diffusion Models Without Attention,Jing Nathan Yan · Jiatao Gu · Alexander Rush, ,,https://www.semanticscholar.org/paper/Diffusion-Models-Without-Attention-Yan-Gu/31245344a6eb6cd897a71928dc4b174ab75e4070,,,,,nan +DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models,Nastaran Saadati · Minh Pham · Nasla Saleem · Joshua R. Waite · Aditya Balu · Zhanhong Jiang · Chinmay Hegde · Soumik Sarkar, ,https://arxiv.org/abs/2404.08079,,2404.08079.pdf,DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models,"Recent advances in decentralized deep learning algorithms have demonstrated +cutting-edge performance on various tasks with large pre-trained models. +However, a pivotal prerequisite for achieving this level of competitiveness is +the significant communication and computation overheads when updating these +models, which prohibits the applications of them to real-world scenarios. To +address this issue, drawing inspiration from advanced model merging techniques +without requiring additional training, we introduce the Decentralized Iterative +Merging-And-Training (DIMAT) paradigm--a novel decentralized deep learning +framework. Within DIMAT, each agent is trained on their local data and +periodically merged with their neighboring agents using advanced model merging +techniques like activation matching until convergence is achieved. DIMAT +provably converges with the best available rate for nonconvex functions with +various first-order methods, while yielding tighter error bounds compared to +the popular existing approaches. We conduct a comprehensive empirical analysis +to validate DIMAT's superiority over baselines across diverse computer vision +tasks sourced from multiple datasets. Empirical results validate our +theoretical claims by showing that DIMAT attains faster and higher initial gain +in accuracy with independent and identically distributed (IID) and non-IID +data, incurring lower communication overhead. This DIMAT paradigm presents a +new opportunity for the future decentralized learning, enhancing its +adaptability to real-world with sparse and light-weight communication and +computation.",cs.LG,"['cs.LG', 'cs.CV', 'math.OC']" +Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problem,Haoquan Zhang · Ronggang Huang · Yi Xie · Huaidong Zhang, ,https://arxiv.org/abs/2310.05364,,2310.05364.pdf,Universal Multi-modal Entity Alignment via Iteratively Fusing Modality Similarity Paths,"The objective of Entity Alignment (EA) is to identify equivalent entity pairs +from multiple Knowledge Graphs (KGs) and create a more comprehensive and +unified KG. The majority of EA methods have primarily focused on the structural +modality of KGs, lacking exploration of multi-modal information. A few +multi-modal EA methods have made good attempts in this field. Still, they have +two shortcomings: (1) inconsistent and inefficient modality modeling that +designs complex and distinct models for each modality; (2) ineffective modality +fusion due to the heterogeneous nature of modalities in EA. To tackle these +challenges, we propose PathFusion, consisting of two main components: (1) MSP, +a unified modeling approach that simplifies the alignment process by +constructing paths connecting entities and modality nodes to represent multiple +modalities; (2) IRF, an iterative fusion method that effectively combines +information from different modalities using the path as an information carrier. +Experimental results on real-world datasets demonstrate the superiority of +PathFusion over state-of-the-art methods, with 22.4%-28.9% absolute improvement +on Hits@1, and 0.194-0.245 absolute improvement on MRR.",cs.CL,"['cs.CL', 'cs.AI']" +Hearing Anything Anywhere,Mason Wang · Ryosuke Sawata · Samuel Clarke · Ruohan Gao · Shangzhe Wu · Jiajun Wu, ,,https://zenodo.org/records/11195833,,,,,nan +OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees,Hakyeong Kim · Andreas Meuleman · Hyeonjoong Jang · James Tompkin · Min H. Kim,https://vclab.kaist.ac.kr/cvpr2024p2/index.html,https://arxiv.org/abs/2404.00678,,2404.00678.pdf,OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees,"We present a method to reconstruct indoor and outdoor static scene geometry +and appearance from an omnidirectional video moving in a small circular sweep. +This setting is challenging because of the small baseline and large depth +ranges, making it difficult to find ray crossings. To better constrain the +optimization, we estimate geometry as a signed distance field within a +spherical binoctree data structure and use a complementary efficient tree +traversal strategy based on a breadth-first search for sampling. Unlike regular +grids or trees, the shape of this structure well-matches the camera setting, +creating a better memory-quality trade-off. From an initial depth estimate, the +binoctree is adaptively subdivided throughout the optimization; previous +methods use a fixed depth that leaves the scene undersampled. In comparison +with three neural optimization methods and two non-neural methods, ours shows +decreased geometry error on average, especially in a detailed scene, while +significantly reducing the required number of voxels to represent such details.",cs.CV,"['cs.CV', 'cs.GR']" +Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective,Yu Mitsuzumi · Akisato Kimura · Hisashi Kashima, ,,https://akisatok.tech/news/a-paper-accepted-to-cvpr2024,,,,,nan +BANF: Band-limited Neural Fields for Levels of Detail Reconstruction,Ahan Shabanov · Shrisudhan Govindarajan · Cody Reading · Leili Goli · Daniel Rebain · Kwang Moo Yi · Andrea Tagliasacchi, ,https://arxiv.org/abs/2404.13024,,2404.13024.pdf,BANF: Band-limited Neural Fields for Levels of Detail Reconstruction,"Largely due to their implicit nature, neural fields lack a direct mechanism +for filtering, as Fourier analysis from discrete signal processing is not +directly applicable to these representations. Effective filtering of neural +fields is critical to enable level-of-detail processing in downstream +applications, and support operations that involve sampling the field on regular +grids (e.g. marching cubes). Existing methods that attempt to decompose neural +fields in the frequency domain either resort to heuristics or require extensive +modifications to the neural field architecture. We show that via a simple +modification, one can obtain neural fields that are low-pass filtered, and in +turn show how this can be exploited to obtain a frequency decomposition of the +entire signal. We demonstrate the validity of our technique by investigating +level-of-detail reconstruction, and showing how coarser representations can be +computed effectively.",cs.CV,"['cs.CV', 'eess.IV']" +PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution,Honghao Chen · Xiangxiang Chu · Renyongjian · Xin Zhao · Kaiqi Huang, ,https://arxiv.org/abs/2403.07589,,2403.07589.pdf,PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution,"Recently, some large kernel convnets strike back with appealing performance +and efficiency. However, given the square complexity of convolution, scaling up +kernels can bring about an enormous amount of parameters and the proliferated +parameters can induce severe optimization problem. Due to these issues, current +CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e., +51x5 + 5x51) and start to saturate as the kernel size continues growing. In +this paper, we delve into addressing these vital issues and explore whether we +can continue scaling up kernels for more performance gains. Inspired by human +vision, we propose a human-like peripheral convolution that efficiently reduces +over 90% parameter count of dense grid convolution through parameter sharing, +and manage to scale up kernel size to extremely large. Our peripheral +convolution behaves highly similar to human, reducing the complexity of +convolution from O(K^2) to O(logK) without backfiring performance. Built on +this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK +outperforms modern vision Transformers and ConvNet architectures like Swin, +ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet +classification, semantic segmentation on ADE20K and object detection on MS +COCO. For the first time, we successfully scale up the kernel size of CNNs to +an unprecedented 101x101 and demonstrate consistent improvements.",cs.CV,['cs.CV'] +Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos,Mehmet Saygin Seyfioglu · Wisdom Ikezogwo · Fatemeh Ghezloo · Ranjay Krishna · Linda Shapiro, ,https://arxiv.org/abs/2312.04746,,2312.04746.pdf,Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos,"Diagnosis in histopathology requires a global whole slide images (WSIs) +analysis, requiring pathologists to compound evidence from different WSI +patches. The gigapixel scale of WSIs poses a challenge for histopathology +multi-modal models. Training multi-model models for histopathology requires +instruction tuning datasets, which currently contain information for individual +image patches, without a spatial grounding of the concepts within each patch +and without a wider view of the WSI. Therefore, they lack sufficient diagnostic +capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a +large-scale dataset of 107,131 histopathology-specific instruction +question/answer pairs, grounded within diagnostically relevant image patches +that make up the WSI. Our dataset is collected by leveraging educational +histopathology videos from YouTube, which provides spatial localization of +narrations by automatically extracting the narrators' cursor positions. +Quilt-Instruct supports contextual reasoning by extracting diagnosis and +supporting facts from the entire WSI. Using Quilt-Instruct, we train +Quilt-LLaVA, which can reason beyond the given single image patch, enabling +diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a +comprehensive evaluation dataset created from 985 images and 1283 +human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using +public histopathology datasets, where Quilt-LLaVA significantly outperforms +SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set +VQA. Our code, data, and model are publicly accessible at +quilt-llava.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation,Javier Tirado-Garín · Javier Civera,https://github.com/javrtg/C2P,https://arxiv.org/abs/2312.05995,,2312.05995.pdf,From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation,"Estimating the relative camera pose from $n \geq 5$ correspondences between +two calibrated views is a fundamental task in computer vision. This process +typically involves two stages: 1) estimating the essential matrix between the +views, and 2) disambiguating among the four candidate relative poses that +satisfy the epipolar geometry. In this paper, we demonstrate a novel approach +that, for the first time, bypasses the second stage. Specifically, we show that +it is possible to directly estimate the correct relative camera pose from +correspondences without needing a post-processing step to enforce the +cheirality constraint on the correspondences. Building on recent advances in +certifiable non-minimal optimization, we frame the relative pose estimation as +a Quadratically Constrained Quadratic Program (QCQP). By applying the +appropriate constraints, we ensure the estimation of a camera pose that +corresponds to a valid 3D geometry and that is globally optimal when certified. +We validate our method through exhaustive synthetic and real-world experiments, +confirming the efficacy, efficiency and accuracy of the proposed approach. Code +is available at https://github.com/javrtg/C2P.",cs.CV,['cs.CV'] +Diffusion-based Blind Text Image Super-Resolution,Yuzhe Zhang · jiawei zhang · Hao Li · Zhouxia Wang · Luwei Hou · Dongqing Zou · Liheng Bian, ,https://arxiv.org/abs/2312.08886,,2312.08886.pdf,Diffusion-based Blind Text Image Super-Resolution,"Recovering degraded low-resolution text images is challenging, especially for +Chinese text images with complex strokes and severe degradation in real-world +scenarios. Ensuring both text fidelity and style realness is crucial for +high-quality text image super-resolution. Recently, diffusion models have +achieved great success in natural image synthesis and restoration due to their +powerful data distribution modeling abilities and data generation capabilities. +In this work, we propose an Image Diffusion Model (IDM) to restore text images +with realistic styles. For diffusion models, they are not only suitable for +modeling realistic image distribution but also appropriate for learning text +distribution. Since text prior is important to guarantee the correctness of the +restored text structure according to existing arts, we also propose a Text +Diffusion Model (TDM) for text recognition which can guide IDM to generate text +images with correct structures. We further propose a Mixture of Multi-modality +module (MoM) to make these two diffusion models cooperate with each other in +all the diffusion steps. Extensive experiments on synthetic and real-world +datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution +(DiffTSR) can restore text images with more accurate text structures as well as +more realistic appearances simultaneously.",cs.CV,['cs.CV'] +Language-driven Grasp Detection,An Dinh Vuong · Minh Nhat VU · Baoru Huang · Nghia Nguyen · Hieu Le · Thieu Vo · Thieu Vo · Anh Nguyen,https://airvlab.github.io/grasp-anything/,https://ar5iv.labs.arxiv.org/html/2309.09818,,2309.09818.pdf,Grasp-Anything: Large-scale Grasp Dataset from Foundation Models,"Foundation models such as ChatGPT have made significant strides in robotic +tasks due to their universal representation of real-world domains. In this +paper, we leverage foundation models to tackle grasp detection, a persistent +challenge in robotics with broad industrial applications. Despite numerous +grasp datasets, their object diversity remains limited compared to real-world +figures. Fortunately, foundation models possess an extensive repository of +real-world knowledge, including objects we encounter in our daily lives. As a +consequence, a promising solution to the limited representation in previous +grasp datasets is to harness the universal knowledge embedded in these +foundation models. We present Grasp-Anything, a new large-scale grasp dataset +synthesized from foundation models to implement this solution. Grasp-Anything +excels in diversity and magnitude, boasting 1M samples with text descriptions +and more than 3M objects, surpassing prior datasets. Empirically, we show that +Grasp-Anything successfully facilitates zero-shot grasp detection on +vision-based tasks and real-world robotic experiments. Our dataset and code are +available at https://grasp-anything-2023.github.io.",cs.RO,"['cs.RO', 'cs.CV']" +Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,Shengbang Tong · Zhuang Liu · Zhuang Liu · Yuexiang Zhai · Yi Ma · Yann LeCun · Saining Xie, ,http://export.arxiv.org/abs/2401.06209,,2401.06209.pdf,Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,"Is vision good enough for language? Recent advancements in multimodal models +primarily stem from the powerful reasoning abilities of large language models +(LLMs). However, the visual component typically depends only on the +instance-level contrastive language-image pre-training (CLIP). Our research +reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still +exhibit systematic shortcomings. To understand the roots of these errors, we +explore the gap between the visual embedding space of CLIP and vision-only +self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP +perceives as similar despite their clear visual differences. With these pairs, +we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes +areas where state-of-the-art systems, including GPT-4V, struggle with +straightforward questions across nine basic visual patterns, often providing +incorrect answers and hallucinated explanations. We further evaluate various +CLIP-based vision-and-language models and found a notable correlation between +visual patterns that challenge CLIP models and those problematic for multimodal +LLMs. As an initial effort to address these issues, we propose a Mixture of +Features (MoF) approach, demonstrating that integrating vision self-supervised +learning features with MLLMs can significantly enhance their visual grounding +capabilities. Together, our research suggests visual representation learning +remains an open challenge, and accurate visual grounding is crucial for future +successful multimodal systems.",cs.CV,['cs.CV'] +Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods,Mengyu Dai · Amir Hossein Raffiee · Aashish Jain · Joshua Correa, ,,https://ieeexplore.ieee.org/document/10452779,,,,,nan +Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields,Joshua Ahn · Haochen Wang · Raymond A. Yeh · Greg Shakhnarovich,https://pals.ttic.edu/p/alpha-invariance,https://arxiv.org/abs/2404.02155,,2404.02155.pdf,Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields,"Scale-ambiguity in 3D scene dimensions leads to magnitude-ambiguity of +volumetric densities in neural radiance fields, i.e., the densities double when +scene size is halved, and vice versa. We call this property alpha invariance. +For NeRFs to better maintain alpha invariance, we recommend 1) parameterizing +both distance and volume densities in log space, and 2) a +discretization-agnostic initialization strategy to guarantee high ray +transmittance. We revisit a few popular radiance field models and find that +these systems use various heuristics to deal with issues arising from scene +scaling. We test their behaviors and show our recipe to be more robust.",cs.CV,['cs.CV'] +Prompt-Driven Referring Image Segmentation with Instance Contrasting,Chao Shang · Zichen Song · Heqian Qiu · Lanxiao Wang · Fanman Meng · Hongliang Li, ,https://arxiv.org/abs/2310.19721,,2310.19721.pdf,Promise:Prompt-driven 3D Medical Image Segmentation Using Pretrained Image Foundation Models,"To address prevalent issues in medical imaging, such as data acquisition +challenges and label availability, transfer learning from natural to medical +image domains serves as a viable strategy to produce reliable segmentation +results. However, several existing barriers between domains need to be broken +down, including addressing contrast discrepancies, managing anatomical +variability, and adapting 2D pretrained models for 3D segmentation tasks. In +this paper, we propose ProMISe,a prompt-driven 3D medical image segmentation +model using only a single point prompt to leverage knowledge from a pretrained +2D image foundation model. In particular, we use the pretrained vision +transformer from the Segment Anything Model (SAM) and integrate lightweight +adapters to extract depth-related (3D) spatial context without updating the +pretrained weights. For robust results, a hybrid network with complementary +encoders is designed, and a boundary-aware loss is proposed to achieve precise +boundaries. We evaluate our model on two public datasets for colon and pancreas +tumor segmentations, respectively. Compared to the state-of-the-art +segmentation methods with and without prompt engineering, our proposed method +achieves superior performance. The code is publicly available at +https://github.com/MedICL-VU/ProMISe.",eess.IV,"['eess.IV', 'cs.CV']" +DreamVideo: Composing Your Dream Videos with Customized Subject and Motion,Yujie Wei · Shiwei Zhang · Zhiwu Qing · Hangjie Yuan · Zhiheng Liu · Yu Liu · Yingya Zhang · Jingren Zhou · Hongming Shan,https://dreamvideo-t2v.github.io/,https://arxiv.org/abs/2312.04433,,2312.04433.pdf,DreamVideo: Composing Your Dream Videos with Customized Subject and Motion,"Customized generation using diffusion models has made impressive progress in +image generation, but remains unsatisfactory in the challenging video +generation task, as it requires the controllability of both subjects and +motions. To that end, we present DreamVideo, a novel approach to generating +personalized videos from a few static images of the desired subject and a few +videos of target motion. DreamVideo decouples this task into two stages, +subject learning and motion learning, by leveraging a pre-trained video +diffusion model. The subject learning aims to accurately capture the fine +appearance of the subject from provided images, which is achieved by combining +textual inversion and fine-tuning of our carefully designed identity adapter. +In motion learning, we architect a motion adapter and fine-tune it on the given +videos to effectively model the target motion pattern. Combining these two +lightweight and efficient adapters allows for flexible customization of any +subject with any motion. Extensive experimental results demonstrate the +superior performance of our DreamVideo over the state-of-the-art methods for +customized video generation. Our project page is at +https://dreamvideo-t2v.github.io.",cs.CV,['cs.CV'] +Multi-Attribute Interactions Matter for 3D Visual Grounding,Can Xu · Yuehui Han · Rui Xu · Le Hui · Jin Xie · Jian Yang, ,https://arxiv.org/abs/2404.19696,,2404.19696.pdf,Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,"3D visual grounding is a challenging task that often requires direct and +dense supervision, notably the semantic label for each object in the scene. In +this paper, we instead study the naturally supervised setting that learns from +only 3D scene and QA pairs, where prior works underperform. We propose the +Language-Regularized Concept Learner (LARC), which uses constraints from +language as regularization to significantly improve the accuracy of +neuro-symbolic concept learners in the naturally supervised setting. Our +approach is based on two core insights: the first is that language constraints +(e.g., a word's relation to another) can serve as effective regularization for +structured representations in neuro-symbolic models; the second is that we can +query large language models to distill such constraints from language +properties. We show that LARC improves performance of prior works in naturally +supervised 3D visual grounding, and demonstrates a wide range of 3D visual +reasoning capabilities-from zero-shot composition, to data efficiency and +transferability. Our method represents a promising step towards regularizing +structured visual reasoning frameworks with language-based priors, for learning +in settings without dense supervision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering,Vivek Gopalakrishnan · Neel Dey · Polina Golland, ,https://arxiv.org/abs/2312.06358,,2312.06358.pdf,Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering,"Surgical decisions are informed by aligning rapid portable 2D intraoperative +images (e.g., X-rays) to a high-fidelity 3D preoperative reference scan (e.g., +CT). 2D/3D image registration often fails in practice: conventional +optimization methods are prohibitively slow and susceptible to local minima, +while neural networks trained on small datasets fail on new patients or require +impractical landmark supervision. We present DiffPose, a self-supervised +approach that leverages patient-specific simulation and differentiable +physics-based rendering to achieve accurate 2D/3D registration without relying +on manually labeled data. Preoperatively, a CNN is trained to regress the pose +of a randomly oriented synthetic X-ray rendered from the preoperative CT. The +CNN then initializes rapid intraoperative test-time optimization that uses the +differentiable X-ray renderer to refine the solution. Our work further proposes +several geometrically principled methods for sampling camera poses from +$\mathbf{SE}(3)$, for sparse differentiable rendering, and for driving +registration in the tangent space $\mathfrak{se}(3)$ with geodesic and +multiscale locality-sensitive losses. DiffPose achieves sub-millimeter accuracy +across surgical datasets at intraoperative speeds, improving upon existing +unsupervised methods by an order of magnitude and even outperforming supervised +baselines. Our code is available at https://github.com/eigenvivek/DiffPose.",cs.CV,['cs.CV'] +MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant,Chenlu Zhan · Gaoang Wang · Yu LIN · Hongwei Wang · Jian Wu, ,https://arxiv.org/abs/2403.04290,,2403.04290.pdf,MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant,"Medical generative models, acknowledged for their high-quality sample +generation ability, have accelerated the fast growth of medical applications. +However, recent works concentrate on separate medical generation models for +distinct medical tasks and are restricted to inadequate medical multi-modal +knowledge, constraining medical comprehensive diagnosis. In this paper, we +propose MedM2G, a Medical Multi-Modal Generative framework, with the key +innovation to align, extract, and generate medical multi-modal within a unified +model. Extending beyond single or two medical modalities, we efficiently align +medical multi-modal through the central alignment approach in the unified +space. Significantly, our framework extracts valuable clinical knowledge by +preserving the medical visual invariant of each imaging modal, thereby +enhancing specific medical information for multi-modal generation. By +conditioning the adaptive cross-guided parameters into the multi-flow diffusion +framework, our model promotes flexible interactions among medical multi-modal +for generation. MedM2G is the first medical generative model that unifies +medical generation tasks of text-to-image, image-to-text, and unified +generation of medical modalities (CT, MRI, X-ray). It performs 5 medical +generation tasks across 10 datasets, consistently outperforming various +state-of-the-art works.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']" +SeD: Semantic-Aware Discriminator for Image Super-Resolution,Bingchen Li · Xin Li · Hanxin Zhu · YEYING JIN · Ruoyu Feng · Zhizheng Zhang · Zhibo Chen, ,https://arxiv.org/abs/2402.19387,,2402.19387.pdf,SeD: Semantic-Aware Discriminator for Image Super-Resolution,"Generative Adversarial Networks (GANs) have been widely used to recover vivid +textures in image super-resolution (SR) tasks. In particular, one discriminator +is utilized to enable the SR network to learn the distribution of real-world +high-quality images in an adversarial training manner. However, the +distribution learning is overly coarse-grained, which is susceptible to virtual +textures and causes counter-intuitive generation results. To mitigate this, we +propose the simple and effective Semantic-aware Discriminator (denoted as SeD), +which encourages the SR network to learn the fine-grained distributions by +introducing the semantics of images as a condition. Concretely, we aim to +excavate the semantics of images from a well-trained semantic extractor. Under +different semantics, the discriminator is able to distinguish the real-fake +images individually and adaptively, which guides the SR network to learn the +more fine-grained semantic-aware textures. To obtain accurate and abundant +semantics, we take full advantage of recently popular pretrained vision models +(PVMs) with extensive datasets, and then incorporate its semantic features into +the discriminator through a well-designed spatial cross-attention module. In +this way, our proposed semantic-aware discriminator empowered the SR network to +produce more photo-realistic and pleasing images. Extensive experiments on two +typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our +proposed methods.",eess.IV,"['eess.IV', 'cs.CV']" +Taming Self-Training for Open-Vocabulary Object Detection,Shiyu Zhao · Samuel Schulter · Long Zhao · Zhixing Zhang · Vijay Kumar BG · Yumin Suh · Manmohan Chandraker · Dimitris N. Metaxas, ,https://arxiv.org/abs/2308.06412,,2308.06412.pdf,Taming Self-Training for Open-Vocabulary Object Detection,"Recent studies have shown promising performance in open-vocabulary object +detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and +language models (VLMs). However, teacher-student self-training, a powerful and +widely used paradigm to leverage PLs, is rarely explored for OVD. This work +identifies two challenges of using self-training in OVD: noisy PLs from VLMs +and frequent distribution changes of PLs. To address these challenges, we +propose SAS-Det that tames self-training for OVD from two key perspectives. +First, we present a split-and-fusion (SAF) head that splits a standard +detection into an open-branch and a closed-branch. This design can reduce noisy +supervision from pseudo boxes. Moreover, the two branches learn complementary +knowledge from different training data, significantly enhancing performance +when fused together. Second, in our view, unlike in closed-set tasks, the PL +distributions in OVD are solely determined by the teacher model. We introduce a +periodic update strategy to decrease the number of updates to the teacher, +thereby decreasing the frequency of changes in PL distributions, which +stabilizes the training process. Extensive experiments demonstrate SAS-Det is +both efficient and effective. SAS-Det outperforms recent models of the same +scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories +of the COCO and LVIS benchmarks, respectively. Code is available at +\url{https://github.com/xiaofeng94/SAS-Det}.",cs.CV,['cs.CV'] +Edit One for All: Interactive Batch Image Editing,Thao Nguyen · Utkarsh Ojha · Yuheng Li · Haotian Liu · Yong Jae Lee,https://thaoshibe.github.io/edit-one-for-all,https://arxiv.org/abs/2401.10219,,2401.10219.pdf,Edit One for All: Interactive Batch Image Editing,"In recent years, image editing has advanced remarkably. With increased human +control, it is now possible to edit an image in a plethora of ways; from +specifying in text what we want to change, to straight up dragging the contents +of the image in an interactive point-based manner. However, most of the focus +has remained on editing single images at a time. Whether and how we can +simultaneously edit large batches of images has remained understudied. With the +goal of minimizing human supervision in the editing process, this paper +presents a novel method for interactive batch image editing using StyleGAN as +the medium. Given an edit specified by users in an example image (e.g., make +the face frontal), our method can automatically transfer that edit to other +test images, so that regardless of their initial state (pose), they all arrive +at the same final state (e.g., all facing front). Extensive experiments +demonstrate that edits performed using our method have similar visual quality +to existing single-image-editing methods, while having more visual consistency +and saving significant time and human effort.",cs.CV,['cs.CV'] +Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning,Desai Xie · Jiahao Li · Hao Tan · Xin Sun · Zhixin Shu · Yi Zhou · Sai Bi · Soren Pirk · Soeren Pirk · ARIE KAUFMAN,https://desaixie.github.io/carve-3d/,https://arxiv.org/abs/2312.13980v1,,2312.13980v1.pdf,Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning,"Recent advancements in the text-to-3D task leverage finetuned text-to-image +diffusion models to generate multi-view images, followed by NeRF +reconstruction. Yet, existing supervised finetuned (SFT) diffusion models still +suffer from multi-view inconsistency and the resulting NeRF artifacts. Although +training longer with SFT improves consistency, it also causes distribution +shift, which reduces diversity and realistic details. We argue that the SFT of +multi-view diffusion models resembles the instruction finetuning stage of the +LLM alignment pipeline and can benefit from RL finetuning (RLFT) methods. +Essentially, RLFT methods optimize models beyond their SFT data distribution by +using their own outputs, effectively mitigating distribution shift. To this +end, we introduce Carve3D, a RLFT method coupled with the Multi-view +Reconstruction Consistency (MRC) metric, to improve the consistency of +multi-view diffusion models. To compute MRC on a set of multi-view images, we +compare them with their corresponding renderings of the reconstructed NeRF at +the same viewpoints. We validate the robustness of MRC with extensive +experiments conducted under controlled inconsistency levels. We enhance the +base RLFT algorithm to stabilize the training process, reduce distribution +shift, and identify scaling laws. Through qualitative and quantitative +experiments, along with a user study, we demonstrate Carve3D's improved +multi-view consistency, the resulting superior NeRF reconstruction quality, and +minimal distribution shift compared to longer SFT. Project webpage: +https://desaixie.github.io/carve-3d.",cs.CV,"['cs.CV', 'cs.LG']" +Density-Guided Semi-Supervised 3D Semantic Segmentation with Dual-Space Hardness Sampling,Jianan Li · Qiulei Dong, ,https://arxiv.org/abs/2306.08045,,2306.08045.pdf,Efficient 3D Semantic Segmentation with Superpoint Transformer,"We introduce a novel superpoint-based transformer architecture for efficient +semantic segmentation of large-scale 3D scenes. Our method incorporates a fast +algorithm to partition point clouds into a hierarchical superpoint structure, +which makes our preprocessing 7 times faster than existing superpoint-based +approaches. Additionally, we leverage a self-attention mechanism to capture the +relationships between superpoints at multiple scales, leading to +state-of-the-art performance on three challenging benchmark datasets: S3DIS +(76.0% mIoU 6-fold validation), KITTI-360 (63.5% on Val), and DALES (79.6%). +With only 212k parameters, our approach is up to 200 times more compact than +other state-of-the-art models while maintaining similar performance. +Furthermore, our model can be trained on a single GPU in 3 hours for a fold of +the S3DIS dataset, which is 7x to 70x fewer GPU-hours than the best-performing +methods. Our code and models are accessible at +github.com/drprojects/superpoint_transformer.",cs.CV,['cs.CV'] +Unifying Automatic and Interactive Matting with Pretrained ViTs,Zixuan Ye · Wenze Liu · He Guo · Yujia Liang · Chaoyi Hong · Hao Lu · Zhiguo Cao, ,,https://dl.acm.org/doi/10.1016/j.inffus.2023.102091,,,,,nan +S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes,Xingyi Li · Zhiguo Cao · Yizheng Wu · Kewei Wang · Ke Xian · Zhe Wang · Guosheng Lin, ,https://arxiv.org/abs/2403.06205,,2403.06205.pdf,S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes,"Current 3D stylization methods often assume static scenes, which violates the +dynamic nature of our real world. To address this limitation, we present +S-DyRF, a reference-based spatio-temporal stylization method for dynamic neural +radiance fields. However, stylizing dynamic 3D scenes is inherently challenging +due to the limited availability of stylized reference images along the temporal +axis. Our key insight lies in introducing additional temporal cues besides the +provided reference. To this end, we generate temporal pseudo-references from +the given stylized reference. These pseudo-references facilitate the +propagation of style information from the reference to the entire dynamic 3D +scene. For coarse style transfer, we enforce novel views and times to mimic the +style details present in pseudo-references at the feature level. To preserve +high-frequency details, we create a collection of stylized temporal pseudo-rays +from temporal pseudo-references. These pseudo-rays serve as detailed and +explicit stylization guidance for achieving fine style transfer. Experiments on +both synthetic and real-world datasets demonstrate that our method yields +plausible stylized results of space-time view synthesis on dynamic 3D scenes.",cs.CV,['cs.CV'] +Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis,Simon Niedermayr · Josef Stumpfegger · rüdiger westermann,https://keksboter.github.io/c3dgs/,https://arxiv.org/abs/2401.02436,,2401.02436.pdf,Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis,"Recently, high-fidelity scene reconstruction with an optimized 3D Gaussian +splat representation has been introduced for novel view synthesis from sparse +image sets. Making such representations suitable for applications like network +streaming and rendering on low-power devices requires significantly reduced +memory consumption as well as improved rendering efficiency. We propose a +compressed 3D Gaussian splat representation that utilizes sensitivity-aware +vector clustering with quantization-aware training to compress directional +colors and Gaussian parameters. The learned codebooks have low bitrates and +achieve a compression rate of up to $31\times$ on real-world scenes with only +minimal degradation of visual quality. We demonstrate that the compressed splat +representation can be efficiently rendered with hardware rasterization on +lightweight GPUs at up to $4\times$ higher framerates than reported via an +optimized GPU compute pipeline. Extensive experiments across multiple datasets +demonstrate the robustness and rendering speed of the proposed approach.",cs.CV,"['cs.CV', 'cs.GR']" +ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images,Nicolas Bourriez · Ihab Bendidi · Cohen Ethan · Gabriel Watkinson · Maxime Sanchez · Guillaume Bollot · Auguste Genovesio, ,https://arxiv.org/abs/2311.15264,,2311.15264.pdf,ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images,"Unlike color photography images, which are consistently encoded into RGB +channels, biological images encompass various modalities, where the type of +microscopy and the meaning of each channel varies with each experiment. +Importantly, the number of channels can range from one to a dozen and their +correlation is often comparatively much lower than RGB, as each of them brings +specific information content. This aspect is largely overlooked by methods +designed out of the bioimage field, and current solutions mostly focus on +intra-channel spatial attention, often ignoring the relationship between +channels, yet crucial in most biological applications. Importantly, the +variable channel type and count prevent the projection of several experiments +to a unified representation for large scale pre-training. In this study, we +propose ChAda-ViT, a novel Channel Adaptive Vision Transformer architecture +employing an Inter-Channel Attention mechanism on images with an arbitrary +number, order and type of channels. We also introduce IDRCell100k, a bioimage +dataset with a rich set of 79 experiments covering 7 microscope modalities, +with a multitude of channel types, and counts varying from 1 to 10 per +experiment. Our architecture, trained in a self-supervised manner, outperforms +existing approaches in several biologically relevant downstream tasks. +Additionally, it can be used to bridge the gap for the first time between +assays with different microscopes, channel numbers or types by embedding +various image and experimental modalities into a unified biological image +representation. The latter should facilitate interdisciplinary studies and pave +the way for better adoption of deep learning in biological image-based +analyses. Code and Data available at https://github.com/nicoboou/chadavit.",cs.CV,"['cs.CV', 'cs.LG']" +Generating Enhanced Negatives for Training Language-Based Object Detectors,Shiyu Zhao · Long Zhao · Vijay Kumar BG · Yumin Suh · Dimitris N. Metaxas · Manmohan Chandraker · Samuel Schulter, ,https://arxiv.org/abs/2401.00094,,2401.00094.pdf,Generating Enhanced Negatives for Training Language-Based Object Detectors,"The recent progress in language-based open-vocabulary object detection can be +largely attributed to finding better ways of leveraging large-scale data with +free-form text annotations. Training such models with a discriminative +objective function has proven successful, but requires good positive and +negative samples. However, the free-form nature and the open vocabulary of +object descriptions make the space of negatives extremely large. Prior works +randomly sample negatives or use rule-based techniques to build them. In +contrast, we propose to leverage the vast knowledge built into modern +generative models to automatically build negatives that are more relevant to +the original data. Specifically, we use large-language-models to generate +negative text descriptions, and text-to-image diffusion models to also generate +corresponding negative images. Our experimental analysis confirms the relevance +of the generated negative data, and its use in language-based detectors +improves performance on two complex benchmarks. Code is available at +\url{https://github.com/xiaofeng94/Gen-Enhanced-Negs}.",cs.CV,['cs.CV'] +Named Entity Driven Zero-Shot Image Manipulation,Zhida Feng · Li Chen · Jing Tian · Jiaxiang Liu · Shikun Feng,https://github.com/feng-zhida/StyleEntity,https://arxiv.org/abs/2307.13497,,2307.13497.pdf,Zshot: An Open-source Framework for Zero-Shot Named Entity Recognition and Relation Extraction,"The Zero-Shot Learning (ZSL) task pertains to the identification of entities +or relations in texts that were not seen during training. ZSL has emerged as a +critical research area due to the scarcity of labeled data in specific domains, +and its applications have grown significantly in recent years. With the advent +of large pretrained language models, several novel methods have been proposed, +resulting in substantial improvements in ZSL performance. There is a growing +demand, both in the research community and industry, for a comprehensive ZSL +framework that facilitates the development and accessibility of the latest +methods and pretrained models.In this study, we propose a novel ZSL framework +called Zshot that aims to address the aforementioned challenges. Our primary +objective is to provide a platform that allows researchers to compare different +state-of-the-art ZSL methods with standard benchmark datasets. Additionally, we +have designed our framework to support the industry with readily available APIs +for production under the standard SpaCy NLP pipeline. Our API is extendible and +evaluable, moreover, we include numerous enhancements such as boosting the +accuracy with pipeline ensembling and visualization utilities available as a +SpaCy extension.",cs.CL,"['cs.CL', 'cs.AI', 'cs.LG']" +Learned Scanpaths Aid Blind Panoramic Video Quality Assessment,Kanglong FAN · Wen Wen · Mu Li · YIFAN PENG · Kede Ma,https://github.com/kalofan/AutoScanpathQA,https://arxiv.org/abs/2404.00252,,2404.00252.pdf,Learned Scanpaths Aid Blind Panoramic Video Quality Assessment,"Panoramic videos have the advantage of providing an immersive and interactive +viewing experience. Nevertheless, their spherical nature gives rise to various +and uncertain user viewing behaviors, which poses significant challenges for +panoramic video quality assessment (PVQA). In this work, we propose an +end-to-end optimized, blind PVQA method with explicit modeling of user viewing +patterns through visual scanpaths. Our method consists of two modules: a +scanpath generator and a quality assessor. The scanpath generator is initially +trained to predict future scanpaths by minimizing their expected code length +and then jointly optimized with the quality assessor for quality prediction. +Our blind PVQA method enables direct quality assessment of panoramic images by +treating them as videos composed of identical frames. Experiments on three +public panoramic image and video quality datasets, encompassing both synthetic +and authentic distortions, validate the superiority of our blind PVQA model +over existing methods.",eess.IV,"['eess.IV', 'cs.CV']" +Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision,Xin Juan · Kaixiong Zhou · Ninghao Liu · Tianlong Chen · Xin Wang, ,https://arxiv.org/abs/2309.05203,,2309.05203.pdf,From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery,"Molecule discovery serves as a cornerstone in numerous scientific domains, +fueling the development of new materials and innovative drug designs. Recent +developments of in-silico molecule discovery have highlighted the promising +results of cross-modal techniques, which bridge molecular structures with their +descriptive annotations. However, these cross-modal methods frequently +encounter the issue of data scarcity, hampering their performance and +application. In this paper, we address the low-resource challenge by utilizing +artificially-real data generated by Large Language Models (LLMs). We first +introduce a retrieval-based prompting strategy to construct high-quality pseudo +data, then explore the optimal method to effectively leverage this pseudo data. +Experiments show that using pseudo data for domain adaptation outperforms all +existing methods, while also requiring a smaller model scale, reduced data size +and lower training cost, highlighting its efficiency. Furthermore, our method +shows a sustained improvement as the volume of pseudo data increases, revealing +the great potential of pseudo data in advancing low-resource cross-modal +molecule discovery. Our code and data are available at +https://github.com/SCIR-HI/ArtificiallyR2R.",cs.CL,['cs.CL'] +ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,Xiaoqi Li · Mingxu Zhang · Yiran Geng · Haoran Geng · Haoran Geng · Yuxing Long · Yan Shen · Renrui Zhang · Jiaming Liu · Hao Dong,https://sites.google.com/view/manipllm,https://arxiv.org/abs/2312.16217,,2312.16217.pdf,ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,"Robot manipulation relies on accurately predicting contact points and +end-effector directions to ensure successful operation. However, learning-based +robot manipulation, trained on a limited category within a simulator, often +struggles to achieve generalizability, especially when confronted with +extensive categories. Therefore, we introduce an innovative approach for robot +manipulation that leverages the robust reasoning capabilities of Multimodal +Large Language Models (MLLMs) to enhance the stability and generalization of +manipulation. By fine-tuning the injected adapters, we preserve the inherent +common sense and reasoning ability of the MLLMs while equipping them with the +ability for manipulation. The fundamental insight lies in the introduced +fine-tuning paradigm, encompassing object category understanding, affordance +prior reasoning, and object-centric pose prediction to stimulate the reasoning +ability of MLLM in manipulation. During inference, our approach utilizes an RGB +image and text prompt to predict the end effector's pose in chain of thoughts. +After the initial contact is established, an active impedance adaptation policy +is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover, +in real world, we design a test-time adaptation (TTA) strategy for manipulation +to enable the model better adapt to the current real-world scene configuration. +Experiments in simulator and real-world show the promising performance of +ManipLLM. More details and demonstrations can be found at +https://sites.google.com/view/manipllm.",cs.CV,"['cs.CV', 'cs.RO']" +Consistent Prompting for Rehearsal-Free Continual Learning,Zhanxin Gao · Jun Cen · Xiaobin Chang,https://github.com/Zhanxin-Gao/CPrompt,https://arxiv.org/abs/2403.08568,,2403.08568.pdf,Consistent Prompting for Rehearsal-Free Continual Learning,"Continual learning empowers models to adapt autonomously to the ever-changing +environment or data streams without forgetting old knowledge. Prompt-based +approaches are built on frozen pre-trained models to learn the task-specific +prompts and classifiers efficiently. Existing prompt-based methods are +inconsistent between training and testing, limiting their effectiveness. Two +types of inconsistency are revealed. Test predictions are made from all +classifiers while training only focuses on the current task classifier without +holistic alignment, leading to Classifier inconsistency. Prompt inconsistency +indicates that the prompt selected during testing may not correspond to the one +associated with this task during training. In this paper, we propose a novel +prompt-based method, Consistent Prompting (CPrompt), for more aligned training +and testing. Specifically, all existing classifiers are exposed to prompt +training, resulting in classifier consistency learning. In addition, prompt +consistency learning is proposed to enhance prediction robustness and boost +prompt selection accuracy. Our Consistent Prompting surpasses its prompt-based +counterparts and achieves state-of-the-art performance on multiple continual +learning benchmarks. Detailed analysis shows that improvements come from more +consistent training and testing.",cs.CV,"['cs.CV', 'cs.LG']" +Pixel-level Semantic Correspondence through Layout-aware Representation Learning and Multi-scale Matching Integration,Yixuan Sun · Zhangyue Yin · Haibo Wang · Yan Wang · Xipeng Qiu · Weifeng Ge · Wenqiang Zhang, ,https://ar5iv.labs.arxiv.org/html/2401.11739,,2401.11739.pdf,EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models,"Diffusion models have recently received increasing research attention for +their remarkable transfer abilities in semantic segmentation tasks. However, +generating fine-grained segmentation masks with diffusion models often requires +additional training on annotated datasets, leaving it unclear to what extent +pre-trained diffusion models alone understand the semantic relations of their +generated images. To address this question, we leverage the semantic knowledge +extracted from Stable Diffusion (SD) and aim to develop an image segmentor +capable of generating fine-grained segmentation maps without any additional +training. The primary difficulty stems from the fact that semantically +meaningful feature maps typically exist only in the spatially lower-dimensional +layers, which poses a challenge in directly extracting pixel-level semantic +relations from these feature maps. To overcome this issue, our framework +identifies semantic correspondences between image pixels and spatial locations +of low-dimensional feature maps by exploiting SD's generation process and +utilizes them for constructing image-resolution segmentation maps. In extensive +experiments, the produced segmentation maps are demonstrated to be well +delineated and capture detailed parts of the images, indicating the existence +of highly accurate pixel-level semantic knowledge in diffusion models.",cs.CV,"['cs.CV', 'cs.LG']" +Model Adaptation for Time Constrained Embodied Control,Jaehyun Song · Minjong Yoo · Honguk Woo, ,,https://ieeexplore.ieee.org/document/10510652,,,,,nan +360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries,Huajian Huang · Changkun Liu · Yipeng Zhu · Hui Cheng · Tristan Braud · Sai-Kit Yeung, ,https://arxiv.org/abs/2311.17389,,2311.17389.pdf,360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries,"Portable 360$^\circ$ cameras are becoming a cheap and efficient tool to +establish large visual databases. By capturing omnidirectional views of a +scene, these cameras could expedite building environment models that are +essential for visual localization. However, such an advantage is often +overlooked due to the lack of valuable datasets. This paper introduces a new +benchmark dataset, 360Loc, composed of 360$^\circ$ images with ground truth +poses for visual localization. We present a practical implementation of +360$^\circ$ mapping combining 360$^\circ$ images with lidar data to generate +the ground truth 6DoF poses. 360Loc is the first dataset and benchmark that +explores the challenge of cross-device visual positioning, involving +360$^\circ$ reference frames, and query frames from pinhole, ultra-wide FoV +fisheye, and 360$^\circ$ cameras. We propose a virtual camera approach to +generate lower-FoV query frames from 360$^\circ$ images, which ensures a fair +comparison of performance among different query types in visual localization +tasks. We also extend this virtual camera approach to feature matching-based +and pose regression-based methods to alleviate the performance loss caused by +the cross-device domain gap, and evaluate its effectiveness against +state-of-the-art baselines. We demonstrate that omnidirectional visual +localization is more robust in challenging large-scale scenes with symmetries +and repetitive structures. These results provide new insights into 360-camera +mapping and omnidirectional visual localization with cross-device queries.",cs.CV,['cs.CV'] +Layout-Agnostic Scene Text Image Synthesis with Diffusion Models,Qilong Zhangli · Jindong Jiang · Di Liu · Licheng Yu · Xiaoliang Dai · Ankit Ramchandani · Guan Pang · Dimitris N. Metaxas · Praveen Krishnan, ,https://arxiv.org/abs/2312.04884,,2312.04884.pdf,UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models,"Text-to-Image (T2I) generation methods based on diffusion model have garnered +significant attention in the last few years. Although these image synthesis +methods produce visually appealing results, they frequently exhibit spelling +errors when rendering text within the generated images. Such errors manifest as +missing, incorrect or extraneous characters, thereby severely constraining the +performance of text image generation based on diffusion models. To address the +aforementioned issue, this paper proposes a novel approach for text image +generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion +[27]). Our approach involves the design and training of a light-weight +character-level text encoder, which replaces the original CLIP encoder and +provides more robust text embeddings as conditional guidance. Then, we +fine-tune the diffusion model using a large-scale dataset, incorporating local +attention control under the supervision of character-level segmentation maps. +Finally, by employing an inference stage refinement process, we achieve a +notably high sequence accuracy when synthesizing text in arbitrarily given +images. Both qualitative and quantitative results demonstrate the superiority +of our method to the state of the art. Furthermore, we showcase several +potential applications of the proposed UDiffText, including text-centric image +synthesis, scene text editing, etc. Code and model will be available at +https://github.com/ZYM-PKU/UDiffText .",cs.CV,['cs.CV'] +Amodal Completion via Progressive Mixed Context Diffusion,Katherine Xu · Lingzhi Zhang · Jianbo Shi,https://k8xu.github.io/amodal,https://arxiv.org/abs/2312.15540,,2312.15540.pdf,Amodal Completion via Progressive Mixed Context Diffusion,"Our brain can effortlessly recognize objects even when partially hidden from +view. Seeing the visible of the hidden is called amodal completion; however, +this task remains a challenge for generative AI despite rapid progress. We +propose to sidestep many of the difficulties of existing approaches, which +typically involve a two-step process of predicting amodal masks and then +generating pixels. Our method involves thinking outside the box, literally! We +go outside the object bounding box to use its context to guide a pre-trained +diffusion inpainting model, and then progressively grow the occluded object and +trim the extra background. We overcome two technical challenges: 1) how to be +free of unwanted co-occurrence bias, which tends to regenerate similar +occluders, and 2) how to judge if an amodal completion has succeeded. Our +amodal completion method exhibits improved photorealistic completion results +compared to existing approaches in numerous successful completion cases. And +the best part? It doesn't require any special training or fine-tuning of +models.",cs.CV,['cs.CV'] +Make Pixels Dance: High-Dynamic Video Generation,Yan Zeng · Guoqiang Wei · Jiani Zheng · Jiaxin Zou · Yang Wei · Yuchen Zhang · Yuchen Zhang · Hang Li, ,https://arxiv.org/abs/2311.10982,,2311.10982.pdf,Make Pixels Dance: High-Dynamic Video Generation,"Creating high-dynamic videos such as motion-rich actions and sophisticated +visual effects poses a significant challenge in the field of artificial +intelligence. Unfortunately, current state-of-the-art video generation methods, +primarily focusing on text-to-video generation, tend to produce video clips +with minimal motions despite maintaining high fidelity. We argue that relying +solely on text instructions is insufficient and suboptimal for video +generation. In this paper, we introduce PixelDance, a novel approach based on +diffusion models that incorporates image instructions for both the first and +last frames in conjunction with text instructions for video generation. +Comprehensive experimental results demonstrate that PixelDance trained with +public data exhibits significantly better proficiency in synthesizing videos +with complex scenes and intricate motions, setting a new standard for video +generation.",cs.CV,['cs.CV'] +MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World,Yining Hong · Zishuo Zheng · Peihao Chen · Yian Wang · Junyan Li · Chuang Gan, ,https://arxiv.org/abs/2401.08577,,2401.08577.pdf,MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World,"Human beings possess the capability to multiply a melange of multisensory +cues while actively exploring and interacting with the 3D world. Current +multi-modal large language models, however, passively absorb sensory data as +inputs, lacking the capacity to actively interact with the objects in the 3D +environment and dynamically collect their multisensory information. To usher in +the study of this area, we propose MultiPLY, a multisensory embodied large +language model that could incorporate multisensory interactive data, including +visual, audio, tactile, and thermal information into large language models, +thereby establishing the correlation among words, actions, and percepts. To +this end, we first collect Multisensory Universe, a large-scale multisensory +interaction dataset comprising 500k data by deploying an LLM-powered embodied +agent to engage with the 3D environment. To perform instruction tuning with +pre-trained LLM on such generated data, we first encode the 3D scene as +abstracted object-centric representations and then introduce action tokens +denoting that the embodied agent takes certain actions within the environment, +as well as state tokens that represent the multisensory state observations of +the agent at each time step. In the inference time, MultiPLY could generate +action tokens, instructing the agent to take the action in the environment and +obtain the next multisensory state observation. The observation is then +appended back to the LLM via state tokens to generate subsequent text or action +tokens. We demonstrate that MultiPLY outperforms baselines by a large margin +through a diverse set of embodied tasks involving object retrieval, tool use, +multisensory captioning, and task decomposition.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.RO']" +Referring Expression Counting,Siyang Dai · Jun Liu · Ngai-Man Cheung, ,https://arxiv.org/abs/2405.15658,,2405.15658.pdf,HDC: Hierarchical Semantic Decoding with Counting Assistance for Generalized Referring Expression Segmentation,"The newly proposed Generalized Referring Expression Segmentation (GRES) +amplifies the formulation of classic RES by involving multiple/non-target +scenarios. Recent approaches focus on optimizing the last modality-fused +feature which is directly utilized for segmentation and object-existence +identification. However, the attempt to integrate all-grained information into +a single joint representation is impractical in GRES due to the increased +complexity of the spatial relationships among instances and deceptive text +descriptions. Furthermore, the subsequent binary target justification across +all referent scenarios fails to specify their inherent differences, leading to +ambiguity in object understanding. To address the weakness, we propose a +$\textbf{H}$ierarchical Semantic $\textbf{D}$ecoding with $\textbf{C}$ounting +Assistance framework (HDC). It hierarchically transfers complementary modality +information across granularities, and then aggregates each well-aligned +semantic correspondence for multi-level decoding. Moreover, with complete +semantic context modeling, we endow HDC with explicit counting capability to +facilitate comprehensive object perception in multiple/single/non-target +settings. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO +benchmarks demonstrate the effectiveness and rationality of HDC which +outperforms the state-of-the-art GRES methods by a remarkable margin. Code will +be available $\href{https://github.com/RobertLuo1/HDC}{here}$.",cs.CV,"['cs.CV', 'cs.AI']" +UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization,Shuaibo Li · Wei Ma · Jianwei Guo · Shibiao Xu · Benchong Li · Xiaopeng Zhang, ,,https://ieeexplore.ieee.org/abstract/document/10155416,,,,,nan +Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization,Takuhiro Kaneko, ,,https://adversarr.github.io/ps/Papers/2024/03/14/pac-nerf-physics-augmented-continuum-neural-radiance-fields-for-geometry-agnostic-system-identification/,,,,,nan +SonicVisionLM: Playing Sound with Vision Language Models,Zhifeng Xie · Shengye Yu · Qile He · Mengtian Li, ,https://arxiv.org/abs/2401.04394,,2401.04394.pdf,SonicVisionLM: Playing Sound with Vision Language Models,"There has been a growing interest in the task of generating sound for silent +videos, primarily because of its practicality in streamlining video +post-production. However, existing methods for video-sound generation attempt +to directly create sound from visual representations, which can be challenging +due to the difficulty of aligning visual representations with audio +representations. In this paper, we present SonicVisionLM, a novel framework +aimed at generating a wide range of sound effects by leveraging vision-language +models(VLMs). Instead of generating audio directly from video, we use the +capabilities of powerful VLMs. When provided with a silent video, our approach +first identifies events within the video using a VLM to suggest possible sounds +that match the video content. This shift in approach transforms the challenging +task of aligning image and audio into more well-studied sub-problems of +aligning image-to-text and text-to-audio through the popular diffusion models. +To improve the quality of audio recommendations with LLMs, we have collected an +extensive dataset that maps text descriptions to specific sound effects and +developed a time-controlled audio adapter. Our approach surpasses current +state-of-the-art methods for converting video to audio, enhancing +synchronization with the visuals, and improving alignment between audio and +video components. Project page: +https://yusiissy.github.io/SonicVisionLM.github.io/",cs.MM,"['cs.MM', 'cs.SD', 'eess.AS']" +A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation,Qucheng Peng · Ce Zheng · Chen Chen, ,https://arxiv.org/abs/2403.11310,,2403.11310.pdf,A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation,"3D human pose data collected in controlled laboratory settings present +challenges for pose estimators that generalize across diverse scenarios. To +address this, domain generalization is employed. Current methodologies in +domain generalization for 3D human pose estimation typically utilize +adversarial training to generate synthetic poses for training. Nonetheless, +these approaches exhibit several limitations. First, the lack of prior +information about the target domain complicates the application of suitable +augmentation through a single pose augmentor, affecting generalization on +target domains. Moreover, adversarial training's discriminator tends to enforce +similarity between source and synthesized poses, impeding the exploration of +out-of-source distributions. Furthermore, the pose estimator's optimization is +not exposed to domain shifts, limiting its overall generalization ability. + To address these limitations, we propose a novel framework featuring two pose +augmentors: the weak and the strong augmentors. Our framework employs +differential strategies for generation and discrimination processes, +facilitating the preservation of knowledge related to source poses and the +exploration of out-of-source distributions without prior information about +target poses. Besides, we leverage meta-optimization to simulate domain shifts +in the optimization process of the pose estimator, thereby improving its +generalization ability. Our proposed approach significantly outperforms +existing methods, as demonstrated through comprehensive experiments on various +benchmark datasets.Our code will be released at +\url{https://github.com/davidpengucf/DAF-DG}.",cs.CV,['cs.CV'] +ProMotion: Prototypes As Motion Learners,Yawen Lu · Dongfang Liu · Qifan Wang · Cheng Han · Yiming Cui · Yiming Cui · Zhiwen Cao · Xueling Zhang · Yingjie Victor Chen · Heng Fan, ,https://ar5iv.labs.arxiv.org/html/2304.11523,,2304.11523.pdf,TransFlow: Transformer as Flow Learner,"Optical flow is an indispensable building block for various important +computer vision tasks, including motion estimation, object tracking, and +disparity measurement. In this work, we propose TransFlow, a pure transformer +architecture for optical flow estimation. Compared to dominant CNN-based +methods, TransFlow demonstrates three advantages. First, it provides more +accurate correlation and trustworthy matching in flow estimation by utilizing +spatial self-attention and cross-attention mechanisms between adjacent frames +to effectively capture global dependencies; Second, it recovers more +compromised information (e.g., occlusion and motion blur) in flow estimation +through long-range temporal association in dynamic scenes; Third, it enables a +concise self-learning paradigm and effectively eliminate the complex and +laborious multi-stage pre-training procedures. We achieve the state-of-the-art +results on the Sintel, KITTI-15, as well as several downstream tasks, including +video object detection, interpolation and stabilization. For its efficacy, we +hope TransFlow could serve as a flexible baseline for optical flow estimation.",cs.CV,['cs.CV'] +Event-assisted Low-Light Video Object Segmentation,Li Hebei · Jin Wang · Jiahui Yuan · Yue Li · Wenming Weng · Yansong Peng · Yueyi Zhang · Zhiwei Xiong · Xiaoyan Sun, ,https://arxiv.org/abs/2404.01945,,2404.01945.pdf,Event-assisted Low-Light Video Object Segmentation,"In the realm of video object segmentation (VOS), the challenge of operating +under low-light conditions persists, resulting in notably degraded image +quality and compromised accuracy when comparing query and memory frames for +similarity computation. Event cameras, characterized by their high dynamic +range and ability to capture motion information of objects, offer promise in +enhancing object visibility and aiding VOS methods under such low-light +conditions. This paper introduces a pioneering framework tailored for low-light +VOS, leveraging event camera data to elevate segmentation accuracy. Our +approach hinges on two pivotal components: the Adaptive Cross-Modal Fusion +(ACMF) module, aimed at extracting pertinent features while fusing image and +event modalities to mitigate noise interference, and the Event-Guided Memory +Matching (EGMM) module, designed to rectify the issue of inaccurate matching +prevalent in low-light settings. Additionally, we present the creation of a +synthetic LLE-DAVIS dataset and the curation of a real-world LLE-VOS dataset, +encompassing frames and events. Experimental evaluations corroborate the +efficacy of our method across both datasets, affirming its effectiveness in +low-light scenarios.",cs.CV,['cs.CV'] +Towards Backward-Compatible Continual Learning of Image Compression,Zhihao Duan · Ming Lu · Justin Yang · Jiangpeng He · Zhan Ma · Fengqing Zhu, ,https://arxiv.org/abs/2402.18862,,2402.18862.pdf,Towards Backward-Compatible Continual Learning of Image Compression,"This paper explores the possibility of extending the capability of +pre-trained neural image compressors (e.g., adapting to new data or target +bitrates) without breaking backward compatibility, the ability to decode +bitstreams encoded by the original model. We refer to this problem as continual +learning of image compression. Our initial findings show that baseline +solutions, such as end-to-end fine-tuning, do not preserve the desired backward +compatibility. To tackle this, we propose a knowledge replay training strategy +that effectively addresses this issue. We also design a new model architecture +that enables more effective continual learning than existing baselines. +Experiments are conducted for two scenarios: data-incremental learning and +rate-incremental learning. The main conclusion of this paper is that neural +image compressors can be fine-tuned to achieve better performance (compared to +their pre-trained version) on new data and rates without compromising backward +compatibility. Our code is available at +https://gitlab.com/viper-purdue/continual-compression",eess.IV,['eess.IV'] +Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses,Inhee Lee · Byungjun Kim · Hanbyul Joo, ,http://export.arxiv.org/abs/2404.14410,,2404.14410.pdf,Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses,"In this paper, we present a method to reconstruct the world and multiple +dynamic humans in 3D from a monocular video input. As a key idea, we represent +both the world and multiple humans via the recently emerging 3D Gaussian +Splatting (3D-GS) representation, enabling to conveniently and efficiently +compose and render them together. In particular, we address the scenarios with +severely limited and sparse observations in 3D human reconstruction, a common +challenge encountered in the real world. To tackle this challenge, we introduce +a novel approach to optimize the 3D-GS representation in a canonical space by +fusing the sparse cues in the common space, where we leverage a pre-trained 2D +diffusion model to synthesize unseen views while keeping the consistency with +the observed 2D appearances. We demonstrate our method can reconstruct +high-quality animatable 3D humans in various challenging examples, in the +presence of occlusion, image crops, few-shot, and extremely sparse +observations. After reconstruction, our method is capable of not only rendering +the scene in any novel views at arbitrary time instances, but also editing the +3D scene by removing individual humans or applying different motions for each +human. Through various experiments, we demonstrate the quality and efficiency +of our methods over alternative existing approaches.",cs.CV,['cs.CV'] +EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling,Haiyang Liu · Zihao Zhu · Giorgio Becherini · YICHEN PENG · Mingyang Su · YOU ZHOU · Xuefei Zhe · Naoya Iwamoto · Bo Zheng · Michael J. Black,https://pantomatrix.github.io/EMAGE/,https://arxiv.org/abs/2401.00374,,2401.00374.pdf,EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling,"We propose EMAGE, a framework to generate full-body human gestures from audio +and masked gestures, encompassing facial, local body, hands, and global +movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new +mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with +FLAME head parameters and further refines the modeling of head, neck, and +finger movements, offering a community-standardized, high-quality 3D motion +captured dataset. EMAGE leverages masked body gesture priors during training to +boost inference performance. It involves a Masked Audio Gesture Transformer, +facilitating joint training on audio-to-gesture generation and masked gesture +reconstruction to effectively encode audio and body gesture hints. Encoded body +hints from masked gestures are then separately employed to generate facial and +body movements. Moreover, EMAGE adaptively merges speech features from the +audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance +the results' fidelity and diversity. Experiments demonstrate that EMAGE +generates holistic gestures with state-of-the-art performance and is flexible +in accepting predefined spatial-temporal gesture inputs, generating complete, +audio-synchronized results. Our code and dataset are available +https://pantomatrix.github.io/EMAGE/",cs.CV,['cs.CV'] +A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint,Xiaofeng Cong · Jie Gui · Jing Zhang · Junming Hou · Hao Shen,https://github.com/Xiaofeng-life/SFSNiD/,https://arxiv.org/abs/2403.18548,,2403.18548.pdf,A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint,"Existing research based on deep learning has extensively explored the problem +of daytime image dehazing. However, few studies have considered the +characteristics of nighttime hazy scenes. There are two distinctions between +nighttime and daytime haze. First, there may be multiple active colored light +sources with lower illumination intensity in nighttime scenes, which may cause +haze, glow and noise with localized, coupled and frequency inconsistent +characteristics. Second, due to the domain discrepancy between simulated and +real-world data, unrealistic brightness may occur when applying a dehazing +model trained on simulated data to real-world data. To address the above two +issues, we propose a semi-supervised model for real-world nighttime dehazing. +First, the spatial attention and frequency spectrum filtering are implemented +as a spatial-frequency domain information interaction module to handle the +first issue. Second, a pseudo-label-based retraining strategy and a local +window-based brightness loss for semi-supervised training process is designed +to suppress haze and glow while achieving realistic brightness. Experiments on +public benchmarks validate the effectiveness of the proposed method and its +superiority over state-of-the-art methods. The source code and Supplementary +Materials are placed in the https://github.com/Xiaofeng-life/SFSNiD.",cs.CV,['cs.CV'] +How to Configure Good In-Context Sequence for Visual Question Answering,Li Li · Jiawei Peng · huiyi chen · Chongyang Gao · Xu Yang, ,https://arxiv.org/abs/2312.01571,,2312.01571.pdf,How to Configure Good In-Context Sequence for Visual Question Answering,"Inspired by the success of Large Language Models in dealing with new tasks +via In-Context Learning (ICL) in NLP, researchers have also developed Large +Vision-Language Models (LVLMs) with ICL capabilities. However, when +implementing ICL using these LVLMs, researchers usually resort to the simplest +way like random sampling to configure the in-context sequence, thus leading to +sub-optimal results. To enhance the ICL performance, in this study, we use +Visual Question Answering (VQA) as case study to explore diverse in-context +configurations to find the powerful ones. Additionally, through observing the +changes of the LVLM outputs by altering the in-context sequence, we gain +insights into the inner properties of LVLMs, improving our understanding of +them. Specifically, to explore in-context configurations, we design diverse +retrieval methods and employ different strategies to manipulate the retrieved +demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2, +VizWiz, and OK-VQA, we uncover three important inner properties of the applied +LVLM and demonstrate which strategies can consistently improve the ICL VQA +performance. Our code is provided in: +https://github.com/GaryJiajia/OFv2_ICL_VQA.",cs.CV,"['cs.CV', 'cs.AI']" +Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval,Minkuk Kim · Hyeon Bae Kim · Jinyoung Moon · Jinwoo Choi · Seong Tae Kim, ,https://arxiv.org/abs/2404.07610,,2404.07610.pdf,Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval,"There has been significant attention to the research on dense video +captioning, which aims to automatically localize and caption all events within +untrimmed video. Several studies introduce methods by designing dense video +captioning as a multitasking problem of event localization and event captioning +to consider inter-task relations. However, addressing both tasks using only +visual input is challenging due to the lack of semantic content. In this study, +we address this by proposing a novel framework inspired by the cognitive +information processing of humans. Our model utilizes external memory to +incorporate prior knowledge. The memory retrieval method is proposed with +cross-modal video-to-text matching. To effectively incorporate retrieved text +features, the versatile encoder and the decoder with visual and textual +cross-attention modules are designed. Comparative experiments have been +conducted to show the effectiveness of the proposed method on ActivityNet +Captions and YouCook2 datasets. Experimental results show promising performance +of our model without extensive pretraining from a large video dataset.",cs.CV,['cs.CV'] +Towards Text-guided 3D Scene Composition,Qihang Zhang · Chaoyang Wang · Aliaksandr Siarohin · Peiye Zhuang · Yinghao Xu · Ceyuan Yang · Dahua Lin · Bolei Zhou · Sergey Tulyakov · Hsin-Ying Lee, ,https://arxiv.org/abs/2312.08885,,2312.08885.pdf,SceneWiz3D: Towards Text-guided 3D Scene Composition,"We are witnessing significant breakthroughs in the technology for generating +3D objects from text. Existing approaches either leverage large text-to-image +models to optimize a 3D representation or train 3D generators on object-centric +datasets. Generating entire scenes, however, remains very challenging as a +scene contains multiple 3D objects, diverse and scattered. In this work, we +introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes +from text. We marry the locality of objects with globality of scenes by +introducing a hybrid 3D representation: explicit for objects and implicit for +scenes. Remarkably, an object, being represented explicitly, can be either +generated from text using conventional text-to-3D approaches, or provided by +users. To configure the layout of the scene and automatically place objects, we +apply the Particle Swarm Optimization technique during the optimization +process. Furthermore, it is difficult for certain parts of the scene (e.g., +corners, occlusion) to receive multi-view supervision, leading to inferior +geometry. We incorporate an RGBD panorama diffusion model to mitigate it, +resulting in high-quality geometry. Extensive evaluation supports that our +approach achieves superior quality over previous approaches, enabling the +generation of detailed and view-consistent 3D scenes.",cs.CV,['cs.CV'] +Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs,Kanchana Ranasinghe · Satya Narayan Shukla · Omid Poursaeed · Michael Ryoo · Tsung-Yu Lin, ,https://arxiv.org/abs/2404.07449,,2404.07449.pdf,Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs,"Integration of Large Language Models (LLMs) into visual domain tasks, +resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in +vision-language tasks, particularly for visual question answering (VQA). +However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial +reasoning and localization awareness. Despite generating highly descriptive and +elaborate textual answers, these models fail at simple tasks like +distinguishing a left vs right location. In this work, we explore how +image-space coordinate based instruction fine-tuning objectives could inject +spatial awareness into V-LLMs. We discover optimal coordinate representations, +data-efficient instruction fine-tuning objectives, and pseudo-data generation +strategies that lead to improved spatial awareness in V-LLMs. Additionally, our +resulting model improves VQA across image and video domains, reduces undesired +hallucination, and generates better contextual object descriptions. Experiments +across 5 vision-language tasks involving 14 different datasets establish the +clear performance improvements achieved by our proposed framework.",cs.CV,['cs.CV'] +Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning,Leonardo Iurada · Marco Ciccone · Tatiana Tommasi,https://iurada.github.io/PX,https://arxiv.org/abs/2405.00906,,2405.00906.pdf,LOTUS: Improving Transformer Efficiency with Sparsity Pruning and Data Lottery Tickets,"Vision transformers have revolutionized computer vision, but their +computational demands present challenges for training and deployment. This +paper introduces LOTUS (LOttery Transformers with Ultra Sparsity), a novel +method that leverages data lottery ticket selection and sparsity pruning to +accelerate vision transformer training while maintaining accuracy. Our approach +focuses on identifying and utilizing the most informative data subsets and +eliminating redundant model parameters to optimize the training process. +Through extensive experiments, we demonstrate the effectiveness of LOTUS in +achieving rapid convergence and high accuracy with significantly reduced +computational requirements. This work highlights the potential of combining +data selection and sparsity techniques for efficient vision transformer +training, opening doors for further research and development in this area.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Fully Geometric Panoramic Localization,Junho Kim · Jiwon Jeong · Young Min Kim,https://82magnolia.github.io/fgpl/,https://arxiv.org/abs/2403.19904,,2403.19904.pdf,Fully Geometric Panoramic Localization,"We introduce a lightweight and accurate localization method that only +utilizes the geometry of 2D-3D lines. Given a pre-captured 3D map, our approach +localizes a panorama image, taking advantage of the holistic 360 view. The +system mitigates potential privacy breaches or domain discrepancies by avoiding +trained or hand-crafted visual descriptors. However, as lines alone can be +ambiguous, we express distinctive yet compact spatial contexts from +relationships between lines, namely the dominant directions of parallel lines +and the intersection between non-parallel lines. The resulting representations +are efficient in processing time and memory compared to conventional visual +descriptor-based methods. Given the groups of dominant line directions and +their intersections, we accelerate the search process to test thousands of pose +candidates in less than a millisecond without sacrificing accuracy. We +empirically show that the proposed 2D-3D matching can localize panoramas for +challenging scenes with similar structures, dramatic domain shifts or +illumination changes. Our fully geometric approach does not involve extensive +parameter tuning or neural network training, making it a practical algorithm +that can be readily deployed in the real world. Project page including the code +is available through this link: https://82magnolia.github.io/fgpl/.",cs.CV,['cs.CV'] +VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift,Leyuan Liu · Yuhan Li · Yunqi Gao · Changxin Gao · Yuanyuan Liu · Jingying Chen,https://github.com/naivate/VS.git,https://arxiv.org/abs/2309.13524,,2309.13524.pdf,Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction,"Reconstructing 3D clothed human avatars from single images is a challenging +task, especially when encountering complex poses and loose clothing. Current +methods exhibit limitations in performance, largely attributable to their +dependence on insufficient 2D image features and inconsistent query methods. +Owing to this, we present the Global-correlated 3D-decoupling Transformer for +clothed Avatar reconstruction (GTA), a novel transformer-based architecture +that reconstructs clothed human avatars from monocular images. Our approach +leverages transformer architectures by utilizing a Vision Transformer model as +an encoder for capturing global-correlated image features. Subsequently, our +innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane +features, using learnable embeddings as queries for cross-plane generation. To +effectively enhance feature fusion with the tri-plane 3D feature and human body +prior, we propose a hybrid prior fusion strategy combining spatial and +prior-enhanced queries, leveraging the benefits of spatial localization and +human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 +datasets illustrate that our method outperforms state-of-the-art approaches in +both geometry and texture reconstruction, exhibiting high robustness to +challenging poses and loose clothing, and producing higher-resolution textures. +Codes will be available at https://github.com/River-Zhang/GTA.",cs.CV,"['cs.CV', 'cs.AI']" +Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network,Yong Shu · Liquan Shen · Xiangyu Hu · Mengyao Li · Zihao Zhou,https://github.com/yungsyu99/Real-HDRV,https://arxiv.org/abs/2405.00244,,2405.00244.pdf,Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network,"As an important and practical way to obtain high dynamic range (HDR) video, +HDR video reconstruction from sequences with alternating exposures is still +less explored, mainly due to the lack of large-scale real-world datasets. +Existing methods are mostly trained on synthetic datasets, which perform poorly +in real scenes. In this work, to facilitate the development of real-world HDR +video reconstruction, we present Real-HDRV, a large-scale real-world benchmark +dataset for HDR video reconstruction, featuring various scenes, diverse motion +patterns, and high-quality labels. Specifically, our dataset contains 500 +LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels, +covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge, +our dataset is the largest real-world HDR video reconstruction dataset. +Correspondingly, we propose an end-to-end network for HDR video reconstruction, +where a novel two-stage strategy is designed to perform alignment sequentially. +Specifically, the first stage performs global alignment with the adaptively +estimated global offsets, reducing the difficulty of subsequent alignment. The +second stage implicitly performs local alignment in a coarse-to-fine manner at +the feature level using the adaptive separable convolution. Extensive +experiments demonstrate that: (1) models trained on our dataset can achieve +better performance on real scenes than those trained on synthetic datasets; (2) +our method outperforms previous state-of-the-art methods. Our dataset is +available at https://github.com/yungsyu99/Real-HDRV.",cs.CV,['cs.CV'] +Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding,Jin-Chuan Shi · Miao Wang · Haobin Duan · Shaohua Guan,https://buaavrcg.github.io/LEGaussians/,https://arxiv.org/abs/2311.18482,,2311.18482.pdf,Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding,"Open-vocabulary querying in 3D space is challenging but essential for scene +understanding tasks such as object localization and segmentation. +Language-embedded scene representations have made progress by incorporating +language features into 3D spaces. However, their efficacy heavily depends on +neural networks that are resource-intensive in training and rendering. Although +recent 3D Gaussians offer efficient and high-quality novel view synthesis, +directly embedding language features in them leads to prohibitive memory usage +and decreased performance. In this work, we introduce Language Embedded 3D +Gaussians, a novel scene representation for open-vocabulary query tasks. +Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we +propose a dedicated quantization scheme that drastically alleviates the memory +requirement, and a novel embedding procedure that achieves smoother yet high +accuracy query, countering the multi-view feature inconsistencies and the +high-frequency inductive bias in point-based representations. Our comprehensive +experiments show that our representation achieves the best visual quality and +language querying accuracy across current language-embedded representations, +while maintaining real-time rendering frame rates on a single desktop GPU.",cs.CV,"['cs.CV', 'cs.GR']" +GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians,Shenhan Qian · Tobias Kirschstein · Liam Schoneveld · Davide Davoli · Simon Giebenhain · Matthias Nießner,https://shenhanqian.github.io/gaussian-avatars,https://arxiv.org/abs/2312.02069,,2312.02069.pdf,GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians,"We introduce GaussianAvatars, a new method to create photorealistic head +avatars that are fully controllable in terms of expression, pose, and +viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian +splats that are rigged to a parametric morphable face model. This combination +facilitates photorealistic rendering while allowing for precise animation +control via the underlying parametric model, e.g., through expression transfer +from a driving sequence or by manually changing the morphable model parameters. +We parameterize each splat by a local coordinate frame of a triangle and +optimize for explicit displacement offset to obtain a more accurate geometric +representation. During avatar reconstruction, we jointly optimize for the +morphable model parameters and Gaussian splat parameters in an end-to-end +fashion. We demonstrate the animation capabilities of our photorealistic avatar +in several challenging scenarios. For instance, we show reenactments from a +driving video, where our method outperforms existing works by a significant +margin.",cs.CV,['cs.CV'] +Garment Recovery with Shape and Deformation Priors,Ren Li · Corentin Dumery · Benoît Guillard · Pascal Fua, ,https://arxiv.org/abs/2311.10356,,2311.10356.pdf,Garment Recovery with Shape and Deformation Priors,"While modeling people wearing tight-fitting clothing has made great strides +in recent years, loose-fitting clothing remains a challenge. We propose a +method that delivers realistic garment models from real-world images, +regardless of garment shape or deformation. To this end, we introduce a fitting +approach that utilizes shape and deformation priors learned from synthetic data +to accurately capture garment shapes and deformations, including large ones. +Not only does our approach recover the garment geometry accurately, it also +yields models that can be directly used by downstream applications such as +animation and simulation.",cs.CV,['cs.CV'] +Neighbor Relations Matter in Video Scene Detection,Jiawei Tan · Hongxing Wang · Jiaxin Li · Zhilong Ou · Zhangbin Qian, ,,https://www.semanticscholar.org/paper/Characters-Link-Shots:-Character-Attention-Network-Tan-Wang/031a0952b156f36ea9da7113ade868754100e4b7,,,,,nan +The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective,Wenqi Jia · Miao Liu · Hao Jiang · Ishwarya Ananthabhotla · James Rehg · Vamsi Krishna Ithapu · Ruohan Gao, ,https://arxiv.org/abs/2312.12870,,2312.12870.pdf,The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective,"In recent years, the thriving development of research related to egocentric +videos has provided a unique perspective for the study of conversational +interactions, where both visual and audio signals play a crucial role. While +most prior work focus on learning about behaviors that directly involve the +camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction +problem, marking the first attempt to infer exocentric conversational +interactions from egocentric videos. We propose a unified multi-modal framework +-- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of +conversation behaviors -- speaking and listening -- for both the camera wearer +as well as all other social partners present in the egocentric video. +Specifically, we adopt the self-attention mechanism to model the +representations across-time, across-subjects, and across-modalities. To +validate our method, we conduct experiments on a challenging egocentric video +dataset that includes multi-speaker and multi-conversation scenarios. Our +results demonstrate the superior performance of our method compared to a series +of baselines. We also present detailed ablation studies to assess the +contribution of each component in our model. Check our project page at +https://vjwq.github.io/AV-CONV/.",cs.CV,['cs.CV'] +Dense Vision Transformer Compression with Few Samples,Hanxiao Zhang · Yifan Zhou · Guo-Hua Wang, ,https://arxiv.org/abs/2403.18708,,2403.18708.pdf,Dense Vision Transformer Compression with Few Samples,"Few-shot model compression aims to compress a large model into a more compact +one with only a tiny training set (even without labels). Block-level pruning +has recently emerged as a leading technique in achieving high accuracy and low +latency in few-shot CNN compression. But, few-shot compression for Vision +Transformers (ViT) remains largely unexplored, which presents a new challenge. +In particular, the issue of sparse compression exists in traditional CNN +few-shot methods, which can only produce very few compressed models of +different model sizes. This paper proposes a novel framework for few-shot ViT +compression named DC-ViT. Instead of dropping the entire block, DC-ViT +selectively eliminates the attention module while retaining and reusing +portions of the MLP module. DC-ViT enables dense compression, which outputs +numerous compressed models that densely populate the range of model complexity. +DC-ViT outperforms state-of-the-art few-shot compression methods by a +significant margin of 10 percentage points, along with lower latency in the +compression of ViT and its variants.",cs.CV,['cs.CV'] +Structure-from-Motion from Pixel-wise Correspondences,Philipp Lindenberger · Paul-Edouard Sarlin · Marc Pollefeys, ,http://export.arxiv.org/abs/2306.13643,,2306.13643.pdf,LightGlue: Local Feature Matching at Light Speed,"We introduce LightGlue, a deep neural network that learns to match local +features across images. We revisit multiple design decisions of SuperGlue, the +state of the art in sparse matching, and derive simple but effective +improvements. Cumulatively, they make LightGlue more efficient - in terms of +both memory and computation, more accurate, and much easier to train. One key +property is that LightGlue is adaptive to the difficulty of the problem: the +inference is much faster on image pairs that are intuitively easy to match, for +example because of a larger visual overlap or limited appearance change. This +opens up exciting prospects for deploying deep matchers in latency-sensitive +applications like 3D reconstruction. The code and trained models are publicly +available at https://github.com/cvg/LightGlue.",cs.CV,['cs.CV'] +KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation,Fengyuan Yang · Kerui Gu · Angela Yao,https://github.com/MartaYang/KITRO,https://arxiv.org/abs/2405.19833,,2405.19833.pdf,KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation,"2D keypoints are commonly used as an additional cue to refine estimated 3D +human meshes. Current methods optimize the pose and shape parameters with a +reprojection loss on the provided 2D keypoints. Such an approach, while simple +and intuitive, has limited effectiveness because the optimal solution is hard +to find in ambiguous parameter space and may sacrifice depth. Additionally, +divergent gradients from distal joints complicate and deviate the refinement of +proximal joints in the kinematic chain. To address these, we introduce +Kinematic-Tree Rotation (KITRO), a novel mesh refinement strategy that +explicitly models depth and human kinematic-tree structure. KITRO treats +refinement from a bone-wise perspective. Unlike previous methods which perform +gradient-based optimizations, our method calculates bone directions in closed +form. By accounting for the 2D pose, bone length, and parent joint's depth, the +calculation results in two possible directions for each child joint. We then +use a decision tree to trace binary choices for all bones along the human +skeleton's kinematic-tree to select the most probable hypothesis. Our +experiments across various datasets and baseline models demonstrate that KITRO +significantly improves 3D joint estimation accuracy and achieves an ideal 2D +fit simultaneously. Our code available at: https://github.com/MartaYang/KITRO.",cs.CV,['cs.CV'] +Orthogonal Adaptation for Modular Customization of Diffusion Models,Ryan Po · Guandao Yang · Kfir Aberman · Gordon Wetzstein, ,https://arxiv.org/abs/2312.02432,,2312.02432.pdf,Orthogonal Adaptation for Modular Customization of Diffusion Models,"Customization techniques for text-to-image models have paved the way for a +wide range of previously unattainable applications, enabling the generation of +specific concepts across diverse contexts and styles. While existing methods +facilitate high-fidelity customization for individual concepts or a limited, +pre-defined set of them, they fall short of achieving scalability, where a +single model can seamlessly render countless concepts. In this paper, we +address a new problem called Modular Customization, with the goal of +efficiently merging customized models that were fine-tuned independently for +individual concepts. This allows the merged model to jointly synthesize +concepts in one image without compromising fidelity or incurring any additional +computational costs. + To address this problem, we introduce Orthogonal Adaptation, a method +designed to encourage the customized models, which do not have access to each +other during fine-tuning, to have orthogonal residual weights. This ensures +that during inference time, the customized models can be summed with minimal +interference. + Our proposed method is both simple and versatile, applicable to nearly all +optimizable weights in the model architecture. Through an extensive set of +quantitative and qualitative evaluations, our method consistently outperforms +relevant baselines in terms of efficiency and identity preservation, +demonstrating a significant leap toward scalable customization of diffusion +models.",cs.CV,['cs.CV'] +Open-World Human-Object Interaction Detection via Multi-modal Prompts,Jie Yang · Bingliang Li · Ailing Zeng · Ailing Zeng · Lei Zhang · Ruimao Zhang, ,,https://openreview.net/forum?id=qrv4wcmmxe,,,,,nan +Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching,Peng Xu · Zhiyu Xiang · Chengyu Qiao · Jingyun Fu · Tianyu Pu, ,https://arxiv.org/abs/2306.15612,,2306.15612.pdf,Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching,"Despite the great success of deep learning in stereo matching, recovering +accurate disparity maps is still challenging. Currently, L1 and cross-entropy +are the two most widely used losses for stereo network training. Compared with +the former, the latter usually performs better thanks to its probability +modeling and direct supervision to the cost volume. However, how to accurately +model the stereo ground-truth for cross-entropy loss remains largely +under-explored. Existing works simply assume that the ground-truth +distributions are uni-modal, which ignores the fact that most of the edge +pixels can be multi-modal. In this paper, a novel adaptive multi-modal +cross-entropy loss (ADL) is proposed to guide the networks to learn different +distribution patterns for each pixel. Moreover, we optimize the disparity +estimator to further alleviate the bleeding or misalignment artifacts in +inference. Extensive experimental results show that our method is generic and +can help classic stereo networks regain state-of-the-art performance. In +particular, GANet with our method ranks $1^{st}$ on both the KITTI 2015 and +2012 benchmarks among the published methods. Meanwhile, excellent +synthetic-to-realistic generalization performance can be achieved by simply +replacing the traditional loss with ours.",cs.CV,['cs.CV'] +EvalCrafter: Benchmarking and Evaluating Large Video Generation Models,Yaofang Liu · Xiaodong Cun · Xuebo Liu · Xintao Wang · Yong Zhang · Haoxin Chen · Yang Liu · Tieyong Zeng · Raymond Chan · Ying Shan, ,https://arxiv.org/abs/2310.11440,,2310.11440.pdf,EvalCrafter: Benchmarking and Evaluating Large Video Generation Models,"The vision and language generative models have been overgrown in recent +years. For video generation, various open-sourced models and public-available +services have been developed to generate high-quality videos. However, these +methods often use a few metrics, e.g., FVD or IS, to evaluate the performance. +We argue that it is hard to judge the large conditional generative models from +the simple metrics since these models are often trained on very large datasets +with multi-aspect abilities. Thus, we propose a novel framework and pipeline +for exhaustively evaluating the performance of the generated videos. Our +approach involves generating a diverse and comprehensive list of 700 prompts +for text-to-video generation, which is based on an analysis of real-world user +data and generated with the assistance of a large language model. Then, we +evaluate the state-of-the-art video generative models on our carefully designed +benchmark, in terms of visual qualities, content qualities, motion qualities, +and text-video alignment with 17 well-selected objective metrics. To obtain the +final leaderboard of the models, we further fit a series of coefficients to +align the objective metrics to the users' opinions. Based on the proposed human +alignment method, our final score shows a higher correlation than simply +averaging the metrics, showing the effectiveness of the proposed evaluation +method.",cs.CV,['cs.CV'] +HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud,WENCAN CHENG · WENCAN CHENG · Hao Tang · Luc Van Gool · Jong Hwan Ko, ,https://arxiv.org/abs/2404.03159,,2404.03159.pdf,HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud,"Extracting keypoint locations from input hand frames, known as 3D hand pose +estimation, is a critical task in various human-computer interaction +applications. Essentially, the 3D hand pose estimation can be regarded as a 3D +point subset generative problem conditioned on input frames. Thanks to the +recent significant progress on diffusion-based generative models, hand pose +estimation can also benefit from the diffusion model to estimate keypoint +locations with high quality. However, directly deploying the existing diffusion +models to solve hand pose estimation is non-trivial, since they cannot achieve +the complex permutation mapping and precise localization. Based on this +motivation, this paper proposes HandDiff, a diffusion-based hand pose +estimation model that iteratively denoises accurate hand pose conditioned on +hand-shaped image-point clouds. In order to recover keypoint permutation and +accurate location, we further introduce joint-wise condition and local detail +condition. Experimental results demonstrate that the proposed HandDiff +significantly outperforms the existing approaches on four challenging hand pose +benchmark datasets. Codes and pre-trained models are publicly available at +https://github.com/cwc1260/HandDiff.",cs.CV,['cs.CV'] +Tuning Stable Rank Shrinkage: Aiming at the Overlooked Structural Risk in Fine-tuning,Sicong Shen · Yang Zhou · Bingzheng Wei · Eric Chang · Yan Xu, ,https://arxiv.org/abs/2312.03732,,2312.03732.pdf,A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA,"As large language models (LLMs) have become increasingly compute and memory +intensive, parameter-efficient fine-tuning (PEFT) methods are now a common +strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), +which adds trainable low-rank ""adapters"" to selected layers. Each adapter +consists of a low-rank matrix product, multiplicatively scaled by a +rank-dependent factor. This scaling factor, which divides adapters by a factor +of the rank, results in slowed learning and stunted performance for LoRA with +higher-rank adapters. Consequently, the use of LoRA in practice has generally +been limited to very low ranks. In this work, we study the impact of the +scaling factor on the learning process and prove that LoRA adapters should be +divided by a factor of the square root of the rank. Modifying LoRA with the +appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA) +method, easily provides for a fine-tuning compute/performance trade-off, where +larger ranks can be used to trade off increased computational resources during +training for better fine-tuning performance, with no change in inference +computing cost.",cs.CL,"['cs.CL', 'cs.LG', 'I.2.7']" +En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data,Yifang Men · Biwen Lei · Yuan Yao · Miaomiao Cui · Zhouhui Lian · Xuansong Xie,https://menyifang.github.io/projects/En3D/index.html,https://arxiv.org/abs/2401.01173,,2401.01173.pdf,En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data,"We present En3D, an enhanced generative scheme for sculpting high-quality 3D +human avatars. Unlike previous works that rely on scarce 3D datasets or limited +2D collections with imbalanced viewing angles and imprecise pose priors, our +approach aims to develop a zero-shot 3D generative scheme capable of producing +visually realistic, geometrically accurate and content-wise diverse 3D humans +without relying on pre-existing 3D or 2D assets. To address this challenge, we +introduce a meticulously crafted workflow that implements accurate physical +modeling to learn the enhanced 3D generative model from synthetic 2D data. +During inference, we integrate optimization modules to bridge the gap between +realistic appearances and coarse 3D shapes. Specifically, En3D comprises three +modules: a 3D generator that accurately models generalizable 3D humans with +realistic appearance from synthesized balanced, diverse, and structured human +images; a geometry sculptor that enhances shape quality using multi-view normal +constraints for intricate human anatomy; and a texturing module that +disentangles explicit texture maps with fidelity and editability, leveraging +semantical UV partitioning and a differentiable rasterizer. Experimental +results show that our approach significantly outperforms prior works in terms +of image quality, geometry accuracy and content diversity. We also showcase the +applicability of our generated avatars for animation and editing, as well as +the scalability of our approach for content-style free adaptation.",cs.CV,['cs.CV'] +Differentiable Point-based Inverse Rendering,Hoon-Gyu Chung · Seokjun Choi · Seung-Hwan Baek,https://hg-chung.github.io/DPIR/,https://arxiv.org/abs/2312.02480,,2312.02480.pdf,Differentiable Point-based Inverse Rendering,"We present differentiable point-based inverse rendering, DPIR, an +analysis-by-synthesis method that processes images captured under diverse +illuminations to estimate shape and spatially-varying BRDF. To this end, we +adopt point-based rendering, eliminating the need for multiple samplings per +ray, typical of volumetric rendering, thus significantly enhancing the speed of +inverse rendering. To realize this idea, we devise a hybrid point-volumetric +representation for geometry and a regularized basis-BRDF representation for +reflectance. The hybrid geometric representation enables fast rendering through +point-based splatting while retaining the geometric details and stability +inherent to SDF-based representations. The regularized basis-BRDF mitigates the +ill-posedness of inverse rendering stemming from limited light-view angular +samples. We also propose an efficient shadow detection method using point-based +shadow map rendering. Our extensive evaluations demonstrate that DPIR +outperforms prior works in terms of reconstruction accuracy, computational +efficiency, and memory footprint. Furthermore, our explicit point-based +representation and rendering enables intuitive geometry and reflectance +editing.",cs.CV,['cs.CV'] +ICP-Flow: LiDAR Scene Flow Estimation with ICP,Yancong Lin · Holger Caesar,https://github.com/yanconglin/ICP-Flow,https://arxiv.org/abs/2402.17351,,2402.17351.pdf,ICP-Flow: LiDAR Scene Flow Estimation with ICP,"Scene flow characterizes the 3D motion between two LiDAR scans captured by an +autonomous vehicle at nearby timesteps. Prevalent methods consider scene flow +as point-wise unconstrained flow vectors that can be learned by either +large-scale training beforehand or time-consuming optimization at inference. +However, these methods do not take into account that objects in autonomous +driving often move rigidly. We incorporate this rigid-motion assumption into +our design, where the goal is to associate objects over scans and then estimate +the locally rigid transformations. We propose ICP-Flow, a learning-free flow +estimator. The core of our design is the conventional Iterative Closest Point +(ICP) algorithm, which aligns the objects over time and outputs the +corresponding rigid transformations. Crucially, to aid ICP, we propose a +histogram-based initialization that discovers the most likely translation, thus +providing a good starting point for ICP. The complete scene flow is then +recovered from the rigid transformations. We outperform state-of-the-art +baselines, including supervised models, on the Waymo dataset and perform +competitively on Argoverse-v2 and nuScenes. Further, we train a feedforward +neural network, supervised by the pseudo labels from our model, and achieve top +performance among all models capable of real-time inference. We validate the +advantage of our model on scene flow estimation with longer temporal gaps, up +to 0.4 seconds where other models fail to deliver meaningful results.",cs.CV,['cs.CV'] +Rolling Shutter Correction with Intermediate Distortion Flow Estimation,Mingdeng Cao · Sidi Yang · Yujiu Yang · Yinqiang Zheng,https://github.com/ljzycmd/DFRSC,https://arxiv.org/abs/2404.06350,,2404.06350.pdf,Rolling Shutter Correction with Intermediate Distortion Flow Estimation,"This paper proposes to correct the rolling shutter (RS) distorted images by +estimating the distortion flow from the global shutter (GS) to RS directly. +Existing methods usually perform correction using the undistortion flow from +the RS to GS. They initially predict the flow from consecutive RS frames, +subsequently rescaling it as the displacement fields from the RS frame to the +underlying GS image using time-dependent scaling factors. Following this, +RS-aware forward warping is employed to convert the RS image into its GS +counterpart. Nevertheless, this strategy is prone to two shortcomings. First, +the undistortion flow estimation is rendered inaccurate by merely linear +scaling the flow, due to the complex non-linear motion nature. Second, RS-aware +forward warping often results in unavoidable artifacts. To address these +limitations, we introduce a new framework that directly estimates the +distortion flow and rectifies the RS image with the backward warping operation. +More specifically, we first propose a global correlation-based flow attention +mechanism to estimate the initial distortion flow and GS feature jointly, which +are then refined by the following coarse-to-fine decoder layers. Additionally, +a multi-distortion flow prediction strategy is integrated to mitigate the issue +of inaccurate flow estimation further. Experimental results validate the +effectiveness of the proposed method, which outperforms state-of-the-art +approaches on various benchmarks while maintaining high efficiency. The project +is available at \url{https://github.com/ljzycmd/DFRSC}.",cs.CV,['cs.CV'] +Programmable Motion Generation for Open-set Motion Control Tasks,Hanchao Liu · Xiaohang Zhan · Shaoli Huang · Tai-Jiang Mu · Ying Shan, ,https://arxiv.org/abs/2405.19283,,2405.19283.pdf,Programmable Motion Generation for Open-Set Motion Control Tasks,"Character animation in real-world scenarios necessitates a variety of +constraints, such as trajectories, key-frames, interactions, etc. Existing +methodologies typically treat single or a finite set of these constraint(s) as +separate control tasks. They are often specialized, and the tasks they address +are rarely extendable or customizable. We categorize these as solutions to the +close-set motion control problem. In response to the complexity of practical +motion control, we propose and attempt to solve the open-set motion control +problem. This problem is characterized by an open and fully customizable set of +motion control tasks. To address this, we introduce a new paradigm, +programmable motion generation. In this paradigm, any given motion control task +is broken down into a combination of atomic constraints. These constraints are +then programmed into an error function that quantifies the degree to which a +motion sequence adheres to them. We utilize a pre-trained motion generation +model and optimize its latent code to minimize the error function of the +generated motion. Consequently, the generated motion not only inherits the +prior of the generative model but also satisfies the required constraints. +Experiments show that we can generate high-quality motions when addressing a +wide range of unseen tasks. These tasks encompass motion control by motion +dynamics, geometric constraints, physical laws, interactions with scenes, +objects or the character own body parts, etc. All of these are achieved in a +unified approach, without the need for ad-hoc paired training data collection +or specialized network designs. During the programming of novel tasks, we +observed the emergence of new skills beyond those of the prior model. With the +assistance of large language models, we also achieved automatic programming. We +hope that this work will pave the way for the motion control of general AI +agents.",cs.CV,['cs.CV'] +Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation,Bingxin Ke · Anton Obukhov · Shengyu Huang · Nando Metzger · Rodrigo Caye Daudt · Konrad Schindler, ,https://arxiv.org/abs/2312.02145,,2312.02145.pdf,Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation,"Monocular depth estimation is a fundamental computer vision task. Recovering +3D depth from a single image is geometrically ill-posed and requires scene +understanding, so it is not surprising that the rise of deep learning has led +to a breakthrough. The impressive progress of monocular depth estimators has +mirrored the growth in model capacity, from relatively modest CNNs to large +Transformer architectures. Still, monocular depth estimators tend to struggle +when presented with images with unfamiliar content and layout, since their +knowledge of the visual world is restricted by the data seen during training, +and challenged by zero-shot generalization to new domains. This motivates us to +explore whether the extensive priors captured in recent generative diffusion +models can enable better, more generalizable depth estimation. We introduce +Marigold, a method for affine-invariant monocular depth estimation that is +derived from Stable Diffusion and retains its rich prior knowledge. The +estimator can be fine-tuned in a couple of days on a single GPU using only +synthetic training data. It delivers state-of-the-art performance across a wide +range of datasets, including over 20% performance gains in specific cases. +Project page: https://marigoldmonodepth.github.io.",cs.CV,['cs.CV'] +I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions,Chengfeng Zhao · Juze Zhang · Jiashen Du · Ziwei Shan · Junye Wang · Jingyi Yu · Jingya Wang · Lan Xu, ,https://arxiv.org/abs/2312.08869,,2312.08869.pdf,I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions,"We are living in a world surrounded by diverse and ""smart"" devices with rich +modalities of sensing ability. Conveniently capturing the interactions between +us humans and these objects remains far-reaching. In this paper, we present +I'm-HOI, a monocular scheme to faithfully capture the 3D motions of both the +human and object in a novel setting: using a minimal amount of RGB camera and +object-mounted Inertial Measurement Unit (IMU). It combines general motion +inference and category-aware refinement. For the former, we introduce a +holistic human-object tracking method to fuse the IMU signals and the RGB +stream and progressively recover the human motions and subsequently the +companion object motions. For the latter, we tailor a category-aware motion +diffusion model, which is conditioned on both the raw IMU observations and the +results from the previous stage under over-parameterization representation. It +significantly refines the initial results and generates vivid body, hand, and +object motions. Moreover, we contribute a large dataset with ground truth human +and object motions, dense RGB inputs, and rich object-mounted IMU measurements. +Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid +capture setting. Our dataset and code will be released to the community.",cs.CV,['cs.CV'] +From a Bird’s Eye View to See: Joint Camera and Subject Registration without the Camera Calibration,Zekun Qian · Ruize Han · Wei Feng · Song Wang,https://github.com/zekunqian/bevsee,,https://allainews.com/item/from-a-birds-eye-view-to-see-joint-camera-and-subject-registration-without-the-camera-calibration-2024-04-30/,,,,,nan +LMDrive: Closed-Loop End-to-End Driving with Large Language Models,Hao Shao · Yuxuan Hu · Letian Wang · Guanglu Song · Steven L. Waslander · Yu Liu · Hongsheng Li, ,https://arxiv.org/abs/2312.07488,,2312.07488.pdf,LMDrive: Closed-Loop End-to-End Driving with Large Language Models,"Despite significant recent progress in the field of autonomous driving, +modern methods still struggle and can incur serious accidents when encountering +long-tail unforeseen events and challenging urban scenarios. On the one hand, +large language models (LLM) have shown impressive reasoning capabilities that +approach ""Artificial General Intelligence"". On the other hand, previous +autonomous driving methods tend to rely on limited-format inputs (e.g. sensor +data and navigation waypoints), restricting the vehicle's ability to understand +language information and interact with humans. To this end, this paper +introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous +driving framework. LMDrive uniquely processes and integrates multi-modal sensor +data with natural language instructions, enabling interaction with humans and +navigation software in realistic instructional settings. To facilitate further +research in language-based closed-loop autonomous driving, we also publicly +release the corresponding dataset which includes approximately 64K +instruction-following data clips, and the LangAuto benchmark that tests the +system's ability to handle complex instructions and challenging driving +scenarios. Extensive closed-loop experiments are conducted to demonstrate +LMDrive's effectiveness. To the best of our knowledge, we're the very first +work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes, +models, and datasets can be found at https://github.com/opendilab/LMDrive",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +TUMTraf V2X Cooperative Perception Dataset,Walter Zimmer · Gerhard Arya Wardana · Suren Sritharan · Xingcheng Zhou · Rui Song · Alois Knoll,https://tum-traffic-dataset.github.io/tumtraf-v2x,https://arxiv.org/abs/2403.01316,,2403.01316.pdf,TUMTraf V2X Cooperative Perception Dataset,"Cooperative perception offers several benefits for enhancing the capabilities +of autonomous vehicles and improving road safety. Using roadside sensors in +addition to onboard sensors increases reliability and extends the sensor range. +External sensors offer higher situational awareness for automated vehicles and +prevent occlusions. We propose CoopDet3D, a cooperative multi-modal fusion +model, and TUMTraf-V2X, a perception dataset, for the cooperative 3D object +detection and tracking task. Our dataset contains 2,000 labeled point clouds +and 5,000 labeled images from five roadside and four onboard sensors. It +includes 30k 3D boxes with track IDs and precise GPS and IMU data. We labeled +eight categories and covered occlusion scenarios with challenging driving +maneuvers, like traffic violations, near-miss events, overtaking, and U-turns. +Through multiple experiments, we show that our CoopDet3D camera-LiDAR fusion +model achieves an increase of +14.36 3D mAP compared to a vehicle camera-LiDAR +fusion model. Finally, we make our dataset, model, labeling tool, and dev-kit +publicly available on our website: +https://tum-traffic-dataset.github.io/tumtraf-v2x.",cs.CV,['cs.CV'] +Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization,Guopeng Li · Ming Qian · Gui-Song Xia, ,https://arxiv.org/abs/2403.14198v1,,2403.14198v1.pdf,Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization,"This paper investigates the effective utilization of unlabeled data for +large-area cross-view geo-localization (CVGL), encompassing both unsupervised +and semi-supervised settings. Common approaches to CVGL rely on +ground-satellite image pairs and employ label-driven supervised training. +However, the cost of collecting precise cross-view image pairs hinders the +deployment of CVGL in real-life scenarios. Without the pairs, CVGL will be more +challenging to handle the significant imaging and spatial gaps between ground +and satellite images. To this end, we propose an unsupervised framework +including a cross-view projection to guide the model for retrieving initial +pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by +leveraging the fact that ``the perfectly paired ground-satellite image is +located in a unique and identical scene"". The framework exhibits competitive +performance compared with supervised works on three open-source benchmarks. Our +code and models will be released on https://github.com/liguopeng0923/UCVGL.",cs.CV,['cs.CV'] +Deep Imbalanced Regression via Hierarchical Classification Adjustment,Haipeng Xiong · Angela Yao, ,https://arxiv.org/abs/2310.17154,,2310.17154.pdf,Deep Imbalanced Regression via Hierarchical Classification Adjustment,"Regression tasks in computer vision, such as age estimation or counting, are +often formulated into classification by quantizing the target space into +classes. Yet real-world data is often imbalanced -- the majority of training +samples lie in a head range of target values, while a minority of samples span +a usually larger tail range. By selecting the class quantization, one can +adjust imbalanced regression targets into balanced classification outputs, +though there are trade-offs in balancing classification accuracy and +quantization error. To improve regression performance over the entire range of +data, we propose to construct hierarchical classifiers for solving imbalanced +regression tasks. The fine-grained classifiers limit the quantization error +while being modulated by the coarse predictions to ensure high accuracy. +Standard hierarchical classification approaches, however, when applied to the +regression problem, fail to ensure that predicted ranges remain consistent +across the hierarchy. As such, we propose a range-preserving distillation +process that can effectively learn a single classifier from the set of +hierarchical classifiers. Our novel hierarchical classification adjustment +(HCA) for imbalanced regression shows superior results on three diverse tasks: +age estimation, crowd counting and depth estimation. We will release the source +code upon acceptance.",cs.CV,['cs.CV'] +Ensemble Diversity Facilitates Adversarial Transferability,Bowen Tang · Zheng Wang · Yi Bin · Qi Dou · Yang Yang · Heng Tao Shen, ,https://arxiv.org/abs/2403.16405,,2403.16405.pdf,Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models,"The integration of an ensemble of deep learning models has been extensively +explored to enhance defense against adversarial attacks. The diversity among +sub-models increases the attack cost required to deceive the majority of the +ensemble, thereby improving the adversarial robustness. While existing +approaches mainly center on increasing diversity in feature representations or +dispersion of first-order gradients with respect to input, the limited +correlation between these diversity metrics and adversarial robustness +constrains the performance of ensemble adversarial defense. In this work, we +aim to enhance ensemble diversity by reducing attack transferability. We +identify second-order gradients, which depict the loss curvature, as a key +factor in adversarial robustness. Computing the Hessian matrix involved in +second-order gradients is computationally expensive. To address this, we +approximate the Hessian-vector product using differential approximation. Given +that low curvature provides better robustness, our ensemble model was designed +to consider the influence of curvature among different sub-models. We introduce +a novel regularizer to train multiple more-diverse low-curvature network +models. Extensive experiments across various datasets demonstrate that our +ensemble model exhibits superior robustness against a range of attacks, +underscoring the effectiveness of our approach.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']" +DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation,Haonan Lin, ,https://arxiv.org/abs/2403.19235,,2403.19235.pdf,DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation,"While large-scale pre-trained text-to-image models can synthesize diverse and +high-quality human-centered images, novel challenges arise with a nuanced task +of ""identity fine editing"": precisely modifying specific features of a subject +while maintaining its inherent identity and context. Existing personalization +methods either require time-consuming optimization or learning additional +encoders, adept in ""identity re-contextualization"". However, they often +struggle with detailed and sensitive tasks like human face editing. To address +these challenges, we introduce DreamSalon, a noise-guided, staged-editing +framework, uniquely focusing on detailed image manipulations and +identity-context preservation. By discerning editing and boosting stages via +the frequency and gradient of predicted noises, DreamSalon first performs +detailed manipulations on specific features in the editing stage, guided by +high-frequency information, and then employs stochastic denoising in the +boosting stage to improve image quality. For more precise editing, DreamSalon +semantically mixes source and target textual prompts, guided by differences in +their embedding covariances, to direct the model's focus on specific +manipulation areas. Our experiments demonstrate DreamSalon's ability to +efficiently and faithfully edit fine details on human faces, outperforming +existing methods both qualitatively and quantitatively.",cs.CV,['cs.CV'] +RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation,Zeyuan Yang · LIU JIAGENG · Peihao Chen · Anoop Cherian · Tim Marks · Jonathan Le Roux · Chuang Gan, ,,https://github.com/zchoi/Awesome-Embodied-Agent-with-LLMs,,,,,nan +FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models,Adrian Bulat · Yassine Ouali · Georgios Tzimiropoulos, ,https://arxiv.org/abs/2405.10286,,2405.10286.pdf,FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models,"Despite noise and caption quality having been acknowledged as important +factors impacting vision-language contrastive pre-training, in this paper, we +show that the full potential of improving the training process by addressing +such issues is yet to be realized. Specifically, we firstly study and analyze +two issues affecting training: incorrect assignment of negative pairs, and low +caption quality and diversity. Then, we devise effective solutions for +addressing both problems, which essentially require training with multiple true +positive pairs. Finally, we propose training with sigmoid loss to address such +a requirement. We show very large gains over the current state-of-the-art for +both image recognition ($\sim +6\%$ on average over 11 datasets) and image +retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).",cs.CV,"['cs.CV', 'cs.AI']" +Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning,Da-Wei Zhou · Hai-Long Sun · Han-Jia Ye · De-Chuan Zhan,https://github.com/sun-hailong/CVPR24-Ease,https://arxiv.org/abs/2403.12030v1,,2403.12030v1.pdf,Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning,"Class-Incremental Learning (CIL) requires a learning system to continually +learn new classes without forgetting. Despite the strong performance of +Pre-Trained Models (PTMs) in CIL, a critical issue persists: learning new +classes often results in the overwriting of old ones. Excessive modification of +the network causes forgetting, while minimal adjustments lead to an inadequate +fit for new classes. As a result, it is desired to figure out a way of +efficient model updating without harming former knowledge. In this paper, we +propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. To enable model +updating without conflict, we train a distinct lightweight adapter module for +each new task, aiming to create task-specific subspaces. These adapters span a +high-dimensional feature space, enabling joint decision-making across multiple +subspaces. As data evolves, the expanding subspaces render the old class +classifiers incompatible with new-stage spaces. Correspondingly, we design a +semantic-guided prototype complement strategy that synthesizes old classes' new +features without using any old class instance. Extensive experiments on seven +benchmark datasets verify EASE's state-of-the-art performance. Code is +available at: https://github.com/sun-hailong/CVPR24-Ease",cs.CV,"['cs.CV', 'cs.LG']" +Generating Handwritten Mathematical Expressions From Symbol Graphs: An End-to-End Pipeline,Yu chen · Fei Gao · YanguangZhang · Maoying Qiao · Nannan Wang,https://github.com/AiArt-HDU/HMEG,,https://link.springer.com/chapter/10.1007/978-3-031-41676-7_9,,,,,nan +AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation,Jeongsoo Choi · Se Jin Park · Minsu Kim · Yong Man Ro, ,https://arxiv.org/html/2312.02512v2,,2312.02512v2.pdf,AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation,"This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech +Translation (AV2AV) framework, where the input and output of the system are +multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key +advantages can be brought: 1) We can perform real-like conversations with +individuals worldwide in a virtual meeting by utilizing our own primary +languages. In contrast to Speech-to-Speech Translation (A2A), which solely +translates between audio modalities, the proposed AV2AV directly translates +between audio-visual speech. This capability enhances the dialogue experience +by presenting synchronized lip movements along with the translated speech. 2) +We can improve the robustness of the spoken language translation system. By +employing the complementary information of audio-visual speech, the system can +effectively translate spoken language even in the presence of acoustic noise, +showcasing robust performance. To mitigate the problem of the absence of a +parallel AV2AV translation dataset, we propose to train our spoken language +translation system with the audio-only dataset of A2A. This is done by learning +unified audio-visual speech representations through self-supervised learning in +advance to train the translation system. Moreover, we propose an AV-Renderer +that can generate raw audio and video in parallel. It is designed with +zero-shot speaker modeling, thus the speaker in source audio-visual speech can +be maintained at the target translated audio-visual speech. The effectiveness +of AV2AV is evaluated with extensive experiments in a many-to-many language +translation setting. Demo page is available on +https://choijeongsoo.github.io/av2av.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM', 'eess.AS']" +DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation,Zeeshan Hayder · Xuming He,https://zeeshanhayder.github.io/DSGG,https://arxiv.org/abs/2403.14886,,2403.14886.pdf,DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation,"Scene graph generation aims to capture detailed spatial and semantic +relationships between objects in an image, which is challenging due to +incomplete labelling, long-tailed relationship categories, and relational +semantic overlap. Existing Transformer-based methods either employ distinct +queries for objects and predicates or utilize holistic queries for relation +triplets and hence often suffer from limited capacity in learning low-frequency +relationships. In this paper, we present a new Transformer-based method, called +DSGG, that views scene graph detection as a direct graph prediction problem +based on a unique set of graph-aware queries. In particular, each graph-aware +query encodes a compact representation of both the node and all of its +relations in the graph, acquired through the utilization of a relaxed sub-graph +matching during the training process. Moreover, to address the problem of +relational semantic overlap, we utilize a strategy for relation distillation, +aiming to efficiently learn multiple instances of semantic relationships. +Extensive experiments on the VG and the PSG datasets show that our model +achieves state-of-the-art results, showing a significant improvement of 3.5\% +and 6.7\% in mR@50 and mR@100 for the scene-graph generation task and achieves +an even more substantial improvement of 8.5\% and 10.3\% in mR@50 and mR@100 +for the panoptic scene graph generation task. Code is available at +\url{https://github.com/zeeshanhayder/DSGG}.",cs.CV,['cs.CV'] +Learn from View Correlation: An Anchor Enhancement Strategy for Multi-view Clustering,Suyuan Liu · KE LIANG · Zhibin Dong · Siwei Wang · Xihong Yang · sihang zhou �� En Zhu · Xinwang Liu, ,https://arxiv.org/abs/2309.00024,,2309.00024.pdf,Efficient Multi-View Graph Clustering with Local and Global Structure Preservation,"Anchor-based multi-view graph clustering (AMVGC) has received abundant +attention owing to its high efficiency and the capability to capture +complementary structural information across multiple views. Intuitively, a +high-quality anchor graph plays an essential role in the success of AMVGC. +However, the existing AMVGC methods only consider single-structure information, +i.e., local or global structure, which provides insufficient information for +the learning task. To be specific, the over-scattered global structure leads to +learned anchors failing to depict the cluster partition well. In contrast, the +local structure with an improper similarity measure results in potentially +inaccurate anchor assignment, ultimately leading to sub-optimal clustering +performance. To tackle the issue, we propose a novel anchor-based multi-view +graph clustering framework termed Efficient Multi-View Graph Clustering with +Local and Global Structure Preservation (EMVGC-LG). Specifically, a unified +framework with a theoretical guarantee is designed to capture local and global +information. Besides, EMVGC-LG jointly optimizes anchor construction and graph +learning to enhance the clustering quality. In addition, EMVGC-LG inherits the +linear complexity of existing AMVGC methods respecting the sample number, which +is time-economical and scales well with the data size. Extensive experiments +demonstrate the effectiveness and efficiency of our proposed method.",cs.LG,['cs.LG'] +SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation,Jiaben Chen · Huaizu Jiang, ,https://arxiv.org/abs/2308.16876v2,,2308.16876v2.pdf,SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation,"Human-centric video frame interpolation has great potential for improving +people's entertainment experiences and finding commercial applications in the +sports analysis industry, e.g., synthesizing slow-motion videos. Although there +are multiple benchmark datasets available in the community, none of them is +dedicated for human-centric scenarios. To bridge this gap, we introduce +SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video +frames of high-resolution ($\geq$720p) slow-motion sports videos crawled from +YouTube. We re-train several state-of-the-art methods on our benchmark, and the +results show a decrease in their accuracy compared to other datasets. It +highlights the difficulty of our benchmark and suggests that it poses +significant challenges even for the best-performing methods, as human bodies +are highly deformable and occlusions are frequent in sports videos. To improve +the accuracy, we introduce two loss terms considering the human-aware priors, +where we add auxiliary supervision to panoptic segmentation and human keypoints +detection, respectively. The loss terms are model agnostic and can be easily +plugged into any video frame interpolation approaches. Experimental results +validate the effectiveness of our proposed loss terms, leading to consistent +performance improvement over 5 existing models, which establish strong baseline +models on our benchmark. The dataset and code can be found at: +https://neu-vi.github.io/SportsSlomo/.",cs.CV,['cs.CV'] +G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images,Zixiong Huang · Qi Chen · Libo Sun · Yifan Yang · Naizhou Wang · Qi Wu · Mingkui Tan, ,https://arxiv.org/abs/2404.07474,,2404.07474.pdf,G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images,"Novel view synthesis aims to generate new view images of a given view image +collection. Recent attempts address this problem relying on 3D geometry priors +(e.g., shapes, sizes, and positions) learned from multi-view images. However, +such methods encounter the following limitations: 1) they require a set of +multi-view images as training data for a specific scene (e.g., face, car or +chair), which is often unavailable in many real-world scenarios; 2) they fail +to extract the geometry priors from single-view images due to the lack of +multi-view supervision. In this paper, we propose a Geometry-enhanced NeRF +(G-NeRF), which seeks to enhance the geometry priors by a geometry-guided +multi-view synthesis approach, followed by a depth-aware training. In the +synthesis process, inspired that existing 3D GAN models can unconditionally +synthesize high-fidelity multi-view images, we seek to adopt off-the-shelf 3D +GAN models, such as EG3D, as a free source to provide geometry priors through +synthesizing multi-view data. Simultaneously, to further improve the geometry +quality of the synthetic data, we introduce a truncation method to effectively +sample latent codes within 3D GAN models. To tackle the absence of multi-view +supervision for single-view images, we design the depth-aware training +approach, incorporating a depth-aware discriminator to guide geometry priors +through depth maps. Experiments demonstrate the effectiveness of our method in +terms of both qualitative and quantitative results.",cs.CV,['cs.CV'] +MaskPLAN: Masked Generative Layout Planning from Partial Input,Hang Zhang · Anton Savov · Benjamin Dillenburger, ,https://arxiv.org/abs/2312.05039,,2312.05039.pdf,SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control,"The field of generative image inpainting and object insertion has made +significant progress with the recent advent of latent diffusion models. +Utilizing a precise object mask can greatly enhance these applications. +However, due to the challenges users encounter in creating high-fidelity masks, +there is a tendency for these methods to rely on more coarse masks (e.g., +bounding box) for these applications. This results in limited control and +compromised background content preservation. To overcome these limitations, we +introduce SmartMask, which allows any novice user to create detailed masks for +precise object insertion. Combined with a ControlNet-Inpaint model, our +experiments demonstrate that SmartMask achieves superior object insertion +quality, preserving the background content more effectively than previous +methods. Notably, unlike prior works the proposed approach can also be used +even without user-mask guidance, which allows it to perform mask-free object +insertion at diverse positions and scales. Furthermore, we find that when used +iteratively with a novel instruction-tuning based planning model, SmartMask can +be used to design detailed layouts from scratch. As compared with user-scribble +based layout design, we observe that SmartMask allows for better quality +outputs with layout-to-image generation methods. Project page is available at +https://smartmask-gen.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.HC', 'cs.LG', 'cs.MM']" +OneLLM: One Framework to Align All Modalities with Language,Jiaming Han · Kaixiong Gong · Yiyuan Zhang · Jiaqi Wang · Kaipeng Zhang · Dahua Lin · Yu Qiao · Peng Gao · Xiangyu Yue, ,https://arxiv.org/abs/2312.03700,,2312.03700.pdf,OneLLM: One Framework to Align All Modalities with Language,"Multimodal large language models (MLLMs) have gained significant attention +due to their strong multimodal understanding capability. However, existing +works rely heavily on modality-specific encoders, which usually differ in +architecture and are limited to common modalities. In this paper, we present +OneLLM, an MLLM that aligns eight modalities to language using a unified +framework. We achieve this through a unified multimodal encoder and a +progressive multimodal alignment pipeline. In detail, we first train an image +projection module to connect a vision encoder with LLM. Then, we build a +universal projection module (UPM) by mixing multiple image projection modules +and dynamic routing. Finally, we progressively align more modalities to LLM +with the UPM. To fully leverage the potential of OneLLM in following +instructions, we also curated a comprehensive multimodal instruction dataset, +including 2M items from image, audio, video, point cloud, depth/normal map, IMU +and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, +encompassing tasks such as multimodal captioning, question answering and +reasoning, where it delivers excellent performance. Code, data, model and +online demo are available at https://github.com/csuhan/OneLLM",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.MM']" +Open-World Semantic Segmentation Including Class Similarity,Matteo Sodano · Federico Magistri · Lucas Nunes · Jens Behley · Cyrill Stachniss, ,https://arxiv.org/abs/2403.07532,,2403.07532.pdf,Open-World Semantic Segmentation Including Class Similarity,"Interpreting camera data is key for autonomously acting systems, such as +autonomous vehicles. Vision systems that operate in real-world environments +must be able to understand their surroundings and need the ability to deal with +novel situations. This paper tackles open-world semantic segmentation, i.e., +the variant of interpreting image data in which objects occur that have not +been seen during training. We propose a novel approach that performs accurate +closed-world semantic segmentation and, at the same time, can identify new +categories without requiring any additional training data. Our approach +additionally provides a similarity measure for every newly discovered class in +an image to a known category, which can be useful information in downstream +tasks such as planning or mapping. Through extensive experiments, we show that +our model achieves state-of-the-art results on classes known from training data +as well as for anomaly segmentation and can distinguish between different +unknown classes.",cs.CV,['cs.CV'] +MovieChat: From Dense Token to Sparse Memory for Long Video Understanding,Enxin Song · Wenhao Chai · Guanhong Wang · Haoyang Zhou · Feiyang Wu · Yucheng Zhang · Tian Ye · Haozhe Chi · Xun Guo · Yanting Zhang · Yan Lu · Jenq-Neng Hwang · Gaoang Wang, ,https://arxiv.org/abs/2307.16449,,2307.16449.pdf,MovieChat: From Dense Token to Sparse Memory for Long Video Understanding,"Recently, integrating video foundation models and large language models to +build a video understanding system can overcome the limitations of specific +pre-defined vision tasks. Yet, existing systems can only handle videos with +very few frames. For long videos, the computation complexity, memory cost, and +long-term temporal connection impose additional challenges. Taking advantage of +the Atkinson-Shiffrin memory model, with tokens in Transformers being employed +as the carriers of memory in combination with our specially designed memory +mechanism, we propose the MovieChat to overcome these challenges. MovieChat +achieves state-of-the-art performance in long video understanding, along with +the released MovieChat-1K benchmark with 1K long video and 14K manual +annotations for validation of the effectiveness of our method.",cs.CV,['cs.CV'] +Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models,Haoning Wu · Zicheng Zhang · Erli Zhang · Chaofeng Chen · Liang Liao · Annan Wang · Kaixin Xu · Chunyi Li · Jingwen Hou · Guangtao Zhai · Xue Geng · Wenxiu Sun · Qiong Yan · Weisi Lin,https://q-future.github.io/Q-Instruct,https://arxiv.org/abs/2311.06783,,2311.06783.pdf,Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models,"Multi-modality foundation models, as represented by GPT-4V, have brought a +new paradigm for low-level visual perception and understanding tasks, that can +respond to a broad range of natural human instructions in a model. While +existing foundation models have shown exciting potentials on low-level visual +tasks, their related abilities are still preliminary and need to be improved. +In order to enhance these models, we conduct a large-scale subjective +experiment collecting a vast number of real human feedbacks on low-level +vision. Each feedback follows a pathway that starts with a detailed description +on the low-level visual appearance (*e.g. clarity, color, brightness* of an +image, and ends with an overall conclusion, with an average length of 45 words. +The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on +18,973 images with diverse low-level appearance. Moreover, to enable foundation +models to robustly respond to diverse types of questions, we design a +GPT-participated conversion to process these feedbacks into diverse-format 200K +instruction-response pairs. Experimental results indicate that the +**Q-Instruct** consistently elevates low-level perception and understanding +abilities across several foundational models. We anticipate that our datasets +can pave the way for a future that general intelligence can perceive, +understand low-level visual appearance and evaluate visual quality like a +human. Our dataset, model zoo, and demo is published at: +https://q-future.github.io/Q-Instruct.",cs.CV,"['cs.CV', 'cs.MM']" +WaveFace: Authentic Face Restoration with Efficient Frequency Recovery,Yunqi Miao · Jiankang Deng · Jungong Han,https://yoqim.github.io/waveface_page/,https://arxiv.org/abs/2403.12760,,2403.12760.pdf,WaveFace: Authentic Face Restoration with Efficient Frequency Recovery,"Although diffusion models are rising as a powerful solution for blind face +restoration, they are criticized for two problems: 1) slow training and +inference speed, and 2) failure in preserving identity and recovering +fine-grained facial details. In this work, we propose WaveFace to solve the +problems in the frequency domain, where low- and high-frequency components +decomposed by wavelet transformation are considered individually to maximize +authenticity as well as efficiency. The diffusion model is applied to recover +the low-frequency component only, which presents general information of the +original image but 1/16 in size. To preserve the original identity, the +generation is conditioned on the low-frequency component of low-quality images +at each denoising step. Meanwhile, high-frequency components at multiple +decomposition levels are handled by a unified network, which recovers complex +facial details in a single step. Evaluations on four benchmark datasets show +that: 1) WaveFace outperforms state-of-the-art methods in authenticity, +especially in terms of identity preservation, and 2) authentic images are +restored with the efficiency 10x faster than existing diffusion model-based BFR +methods.",cs.CV,['cs.CV'] +MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection,Boyang Peng · Sanqing Qu · Yong Wu · Tianpei Zou · Lianghua He · Alois Knoll · Guang Chen · Changjun Jiang,https://github.com/ispc-lab/MAP,https://arxiv.org/abs/2403.04149,,2403.04149.pdf,MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection,"Deep learning has achieved remarkable progress in various applications, +heightening the importance of safeguarding the intellectual property (IP) of +well-trained models. It entails not only authorizing usage but also ensuring +the deployment of models in authorized data domains, i.e., making models +exclusive to certain target domains. Previous methods necessitate concurrent +access to source training data and target unauthorized data when performing IP +protection, making them risky and inefficient for decentralized private data. +In this paper, we target a practical setting where only a well-trained source +model is available and investigate how we can realize IP protection. To achieve +this, we propose a novel MAsk Pruning (MAP) framework. MAP stems from an +intuitive hypothesis, i.e., there are target-related parameters in a +well-trained model, locating and pruning them is the key to IP protection. +Technically, MAP freezes the source model and learns a target-specific binary +mask to prevent unauthorized data usage while minimizing performance +degradation on authorized data. Moreover, we introduce a new metric aimed at +achieving a better balance between source and target performance degradation. +To verify the effectiveness and versatility, we have evaluated MAP in a variety +of scenarios, including vanilla source-available, practical source-free, and +challenging data-free. Extensive experiments indicate that MAP yields new +state-of-the-art performance.",cs.CV,['cs.CV'] +Unsegment Anything by Simulating Deformation,Jiahao Lu · Xingyi Yang · Xinchao Wang, ,https://arxiv.org/abs/2404.02585,,2404.02585.pdf,Unsegment Anything by Simulating Deformation,"Foundation segmentation models, while powerful, pose a significant risk: they +enable users to effortlessly extract any objects from any digital content with +a single click, potentially leading to copyright infringement or malicious +misuse. To mitigate this risk, we introduce a new task ""Anything Unsegmentable"" +to grant any image ""the right to be unsegmented"". The ambitious pursuit of the +task is to achieve highly transferable adversarial attacks against all +prompt-based segmentation models, regardless of model parameterizations and +prompts. We highlight the non-transferable and heterogeneous nature of +prompt-specific adversarial noises. Our approach focuses on disrupting image +encoder features to achieve prompt-agnostic attacks. Intriguingly, targeted +feature attacks exhibit better transferability compared to untargeted ones, +suggesting the optimal update direction aligns with the image manifold. Based +on the observations, we design a novel attack named Unsegment Anything by +Simulating Deformation (UAD). Our attack optimizes a differentiable deformation +function to create a target deformed image, which alters structural information +while preserving achievable feature distance by adversarial example. Extensive +experiments verify the effectiveness of our approach, compromising a variety of +promptable segmentation models with different architectures and prompt +interfaces. We release the code at +https://github.com/jiahaolu97/anything-unsegmentable.",cs.CV,['cs.CV'] +"Low-power, Continuous Remote Behavioral Localization with Event Cameras",Friedhelm Hamann · Suman Ghosh · Ignacio Juarez Martinez · Tom Hart · Alex Kacelnik · Guillermo Gallego,https://tub-rip.github.io/eventpenguins/,https://arxiv.org/abs/2312.03799,,2312.03799.pdf,"Low-power, Continuous Remote Behavioral Localization with Event Cameras","Researchers in natural science need reliable methods for quantifying animal +behavior. Recently, numerous computer vision methods emerged to automate the +process. However, observing wild species at remote locations remains a +challenging task due to difficult lighting conditions and constraints on power +supply and data storage. Event cameras offer unique advantages for +battery-dependent remote monitoring due to their low power consumption and high +dynamic range capabilities. We use this novel sensor to quantify a behavior in +Chinstrap penguins called ecstatic display. We formulate the problem as a +temporal action detection task, determining the start and end times of the +behavior. For this purpose, we recorded a colony of breeding penguins in +Antarctica for several weeks and labeled event data on 16 nests. The developed +method consists of a generator of candidate time intervals (proposals) and a +classifier of the actions within them. The experiments show that the event +cameras' natural response to motion is effective for continuous behavior +monitoring and detection, reaching a mean average precision (mAP) of 58% (which +increases to 63% in good weather conditions). The results also demonstrate the +robustness against various lighting conditions contained in the challenging +dataset. The low-power capabilities of the event camera allow it to record +significantly longer than with a conventional camera. This work pioneers the +use of event cameras for remote wildlife observation, opening new +interdisciplinary opportunities. https://tub-rip.github.io/eventpenguins/",cs.CV,"['cs.CV', 'cs.AI']" +Text-to-3D using Gaussian Splatting,Zilong Chen · Feng Wang · Yikai Wang · Huaping Liu,https://gsgen3d.github.io/,https://arxiv.org/abs/2309.16585,,2309.16585.pdf,Text-to-3D using Gaussian Splatting,"Automatic text-to-3D generation that combines Score Distillation Sampling +(SDS) with the optimization of volume rendering has achieved remarkable +progress in synthesizing realistic 3D objects. Yet most existing text-to-3D +methods by SDS and volume rendering suffer from inaccurate geometry, e.g., the +Janus issue, since it is hard to explicitly integrate 3D priors into implicit +3D representations. Besides, it is usually time-consuming for them to generate +elaborate 3D models with rich colors. In response, this paper proposes GSGEN, a +novel method that adopts Gaussian Splatting, a recent state-of-the-art +representation, to text-to-3D generation. GSGEN aims at generating high-quality +3D objects and addressing existing shortcomings by exploiting the explicit +nature of Gaussian Splatting that enables the incorporation of 3D prior. +Specifically, our method adopts a progressive optimization strategy, which +includes a geometry optimization stage and an appearance refinement stage. In +geometry optimization, a coarse representation is established under 3D point +cloud diffusion prior along with the ordinary 2D SDS optimization, ensuring a +sensible and 3D-consistent rough shape. Subsequently, the obtained Gaussians +undergo an iterative appearance refinement to enrich texture details. In this +stage, we increase the number of Gaussians by compactness-based densification +to enhance continuity and improve fidelity. With these designs, our approach +can generate 3D assets with delicate details and accurate geometry. Extensive +evaluations demonstrate the effectiveness of our method, especially for +capturing high-frequency components. Our code is available at +https://github.com/gsgen3d/gsgen",cs.CV,['cs.CV'] +UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory,Haiwen Diao · Bo Wan · Ying Zhang · Xu Jia · Huchuan Lu · Long Chen,https://github.com/Paranioar/UniPT,https://arxiv.org/abs/2308.14316v2,,2308.14316v2.pdf,UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory,"Parameter-efficient transfer learning (PETL), i.e., fine-tuning a small +portion of parameters, is an effective strategy for adapting pre-trained models +to downstream domains. To further reduce the memory demand, recent PETL works +focus on the more valuable memory-efficient characteristic. In this paper, we +argue that the scalability, adaptability, and generalizability of +state-of-the-art methods are hindered by structural dependency and pertinency +on specific pre-trained backbones. To this end, we propose a new +memory-efficient PETL strategy, Universal Parallel Tuning (UniPT), to mitigate +these weaknesses. Specifically, we facilitate the transfer process via a +lightweight and learnable parallel network, which consists of: 1) A parallel +interaction module that decouples the sequential connections and processes the +intermediate activations detachedly from the pre-trained network. 2) A +confidence aggregation module that learns optimal strategies adaptively for +integrating cross-layer features. We evaluate UniPT with different backbones +(e.g., T5, VSE$\infty$, CLIP4Clip, Clip-ViL, and MDETR) on various +vision-and-language and pure NLP tasks. Extensive ablations on 18 datasets have +validated that UniPT can not only dramatically reduce memory consumption and +outperform the best competitor, but also achieve competitive performance over +other plain PETL methods with lower training memory overhead. Our code is +publicly available at: https://github.com/Paranioar/UniPT.",cs.CV,"['cs.CV', 'cs.MM']" +Single-View Refractive Index Tomography with Neural Fields,Brandon Zhao · Aviad Levis · Liam Connor · Pratul P. Srinivasan · Katherine Bouman, ,https://arxiv.org/abs/2309.04437,,2309.04437.pdf,Single View Refractive Index Tomography with Neural Fields,"Refractive Index Tomography is the inverse problem of reconstructing the +continuously-varying 3D refractive index in a scene using 2D projected image +measurements. Although a purely refractive field is not directly visible, it +bends light rays as they travel through space, thus providing a signal for +reconstruction. The effects of such fields appear in many scientific computer +vision settings, ranging from refraction due to transparent cells in microscopy +to the lensing of distant galaxies caused by dark matter in astrophysics. +Reconstructing these fields is particularly difficult due to the complex +nonlinear effects of the refractive field on observed images. Furthermore, +while standard 3D reconstruction and tomography settings typically have access +to observations of the scene from many viewpoints, many refractive index +tomography problem settings only have access to images observed from a single +viewpoint. We introduce a method that leverages prior knowledge of light +sources scattered throughout the refractive medium to help disambiguate the +single-view refractive index tomography problem. We differentiably trace curved +rays through a neural field representation of the refractive field, and +optimize its parameters to best reproduce the observed image. We demonstrate +the efficacy of our approach by reconstructing simulated refractive fields, +analyze the effects of light source distribution on the recovered field, and +test our method on a simulated dark matter mapping problem where we +successfully recover the 3D refractive field caused by a realistic dark matter +distribution.",cs.CV,"['cs.CV', 'astro-ph.CO']" +MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos,Jielin Qiu · Jiacheng Zhu · William Han · Aditesh Kumar · Karthik Mittal · Claire Jin · Zhengyuan Yang · Linjie Li · Jianfeng Wang · DING ZHAO · Bo Li · Lijuan Wang, ,https://arxiv.org/abs/2306.04216,,2306.04216.pdf,MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos,"Multimodal summarization with multimodal output (MSMO) has emerged as a +promising research direction. Nonetheless, numerous limitations exist within +existing public MSMO datasets, including insufficient maintenance, data +inaccessibility, limited size, and the absence of proper categorization, which +pose significant challenges. To address these challenges and provide a +comprehensive dataset for this new direction, we have meticulously curated the +\textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries +for both video and textual content, providing superior human instruction and +labels for multimodal learning. (2) Comprehensively and meticulously arranged +categorization, spanning 17 principal categories and 170 subcategories to +encapsulate a diverse array of real-world scenarios. (3) Benchmark tests +performed on the proposed dataset to assess various tasks and methods, +including \textit{video summarization}, \textit{text summarization}, and +\textit{multimodal summarization}. To champion accessibility and collaboration, +we will release the \textbf{MMSum} dataset and the data collection tool as +fully open-source resources, fostering transparency and accelerating future +developments. Our project website can be found +at~\url{https://mmsum-dataset.github.io/}",cs.CV,"['cs.CV', 'cs.MM']" +Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching,Shitong Shao · Zeyuan Yin · Muxin Zhou · Xindong Zhang · Zhiqiang Shen, ,https://arxiv.org/abs/2311.17950,,2311.17950.pdf,Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching,"The lightweight ""local-match-global"" matching introduced by SRe2L +successfully creates a distilled dataset with comprehensive information on the +full 224x224 ImageNet-1k. However, this one-sided approach is limited to a +particular backbone, layer, and statistics, which limits the improvement of the +generalization of a distilled dataset. We suggest that sufficient and various +""local-match-global"" matching are more precise and effective than a single one +and has the ability to create a distilled dataset with richer information and +better generalization. We call this perspective ""generalized matching"" and +propose Generalized Various Backbone and Statistical Matching (G-VBSM) in this +work, which aims to create a synthetic dataset with densities, ensuring +consistency with the complete dataset across various backbones, layers, and +statistics. As experimentally demonstrated, G-VBSM is the first algorithm to +obtain strong performance across both small-scale and large-scale datasets. +Specifically, G-VBSM achieves a performance of 38.7% on CIFAR-100 with +128-width ConvNet, 47.6% on Tiny-ImageNet with ResNet18, and 31.4% on the full +224x224 ImageNet-1k with ResNet18, under images per class (IPC) 10, 50, and 10, +respectively. These results surpass all SOTA methods by margins of 3.9%, 6.5%, +and 10.1%, respectively.",cs.CV,"['cs.CV', 'cs.AI']" +Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation,Haofeng Liu · Chenshu Xu · Yifei Yang · Lihua Zeng · Shengfeng He,https://github.com/haofengl/DragNoise,https://arxiv.org/abs/2404.01050,,2404.01050.pdf,Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation,"Point-based interactive editing serves as an essential tool to complement the +controllability of existing generative models. A concurrent work, +DragDiffusion, updates the diffusion latent map in response to user inputs, +causing global latent map alterations. This results in imprecise preservation +of the original content and unsuccessful editing due to gradient vanishing. In +contrast, we present DragNoise, offering robust and accelerated editing without +retracing the latent map. The core rationale of DragNoise lies in utilizing the +predicted noise output of each U-Net as a semantic editor. This approach is +grounded in two critical observations: firstly, the bottleneck features of +U-Net inherently possess semantically rich features ideal for interactive +editing; secondly, high-level semantics, established early in the denoising +process, show minimal variation in subsequent stages. Leveraging these +insights, DragNoise edits diffusion semantics in a single denoising step and +efficiently propagates these changes, ensuring stability and efficiency in +diffusion editing. Comparative experiments reveal that DragNoise achieves +superior control and semantic retention, reducing the optimization time by over +50% compared to DragDiffusion. Our codes are available at +https://github.com/haofengl/DragNoise.",cs.CV,"['cs.CV', 'cs.GR', 'cs.HC', 'cs.LG']" +CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images,Changsheng Chen · Liangwei Lin · Yongqi Chen · Bin Li · Jishen Zeng · Jiwu Huang,https://github.com/chenlewis/Chromaticity-Map-Adapter-for-DPAD,https://arxiv.org/abs/2404.06663,,2404.06663.pdf,Multi-modal Document Presentation Attack Detection With Forensics Trace Disentanglement,"Document Presentation Attack Detection (DPAD) is an important measure in +protecting the authenticity of a document image. However, recent DPAD methods +demand additional resources, such as manual effort in collecting additional +data or knowing the parameters of acquisition devices. This work proposes a +DPAD method based on multi-modal disentangled traces (MMDT) without the above +drawbacks. We first disentangle the recaptured traces by a self-supervised +disentanglement and synthesis network to enhance the generalization capacity in +document images with different contents and layouts. Then, unlike the existing +DPAD approaches that rely only on data in the RGB domain, we propose to +explicitly employ the disentangled recaptured traces as new modalities in the +transformer backbone through adaptive multi-modal adapters to fuse RGB/trace +features efficiently. Visualization of the disentangled traces confirms the +effectiveness of the proposed method in different document contents. Extensive +experiments on three benchmark datasets demonstrate the superiority of our MMDT +method on representing forensic traces of recapturing distortion.",cs.CV,['cs.CV'] +Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super-Resolution,Hongjun Wang · Jiyuan Chen · Yinqiang Zheng · Tieyong Zeng, ,https://arxiv.org/abs/2402.18929,,2402.18929.pdf,Navigating Beyond Dropout: An Intriguing Solution Towards Generalizable Image Super Resolution,"Deep learning has led to a dramatic leap on Single Image Super-Resolution +(SISR) performances in recent years. %Despite the substantial advancement% +While most existing work assumes a simple and fixed degradation model (e.g., +bicubic downsampling), the research of Blind SR seeks to improve model +generalization ability with unknown degradation. Recently, Kong et al pioneer +the investigation of a more suitable training strategy for Blind SR using +Dropout. Although such method indeed brings substantial generalization +improvements via mitigating overfitting, we argue that Dropout simultaneously +introduces undesirable side-effect that compromises model's capacity to +faithfully reconstruct fine details. We show both the theoretical and +experimental analyses in our paper, and furthermore, we present another easy +yet effective training strategy that enhances the generalization ability of the +model by simply modulating its first and second-order features statistics. +Experimental results have shown that our method could serve as a model-agnostic +regularization and outperforms Dropout on seven benchmark datasets including +both synthetic and real-world scenarios.",cs.CV,"['cs.CV', 'cs.AI']" +AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement,Shiwei Jin · Zhen Wang · Lei Wang · Peng Liu · Ning Bi · Truong Nguyen, ,https://arxiv.org/abs/2404.05063,,2404.05063.pdf,AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement,"Facial action unit (AU) intensity plays a pivotal role in quantifying +fine-grained expression behaviors, which is an effective condition for facial +expression manipulation. However, publicly available datasets containing +intensity annotations for multiple AUs remain severely limited, often featuring +a restricted number of subjects. This limitation places challenges to the AU +intensity manipulation in images due to disentanglement issues, leading +researchers to resort to other large datasets with pretrained AU intensity +estimators for pseudo labels. In addressing this constraint and fully +leveraging manual annotations of AU intensities for precise manipulation, we +introduce AUEditNet. Our proposed model achieves impressive intensity +manipulation across 12 AUs, trained effectively with only 18 subjects. +Utilizing a dual-branch architecture, our approach achieves comprehensive +disentanglement of facial attributes and identity without necessitating +additional loss functions or implementing with large batch sizes. This approach +offers a potential solution to achieve desired facial attribute editing despite +the dataset's limited subject count. Our experiments demonstrate AUEditNet's +superior accuracy in editing AU intensities, affirming its capability in +disentangling facial attributes and identity within a limited subject pool. +AUEditNet allows conditioning by either intensity values or target images, +eliminating the need for constructing AU combinations for specific facial +expression synthesis. Moreover, AU intensity estimation, as a downstream task, +validates the consistency between real and edited images, confirming the +effectiveness of our proposed AU intensity manipulation method.",cs.CV,['cs.CV'] +Degree-of-Freedom Matters: Inferring Dynamics from Point Trajectories,Yan Zhang · Sergey Prokudin · Marko Mihajlovic · Qianli Ma · Siyu Tang, ,,https://www.nature.com/articles/s44172-024-00179-3,,,,,nan +Structure-Guided Adversarial Training of Diffusion Models,Ling Yang · Haotian Qian · Zhilong Zhang · Jingwei Liu · Bin CUI, ,https://arxiv.org/abs/2402.17563v1,,2402.17563v1.pdf,Structure-Guided Adversarial Training of Diffusion Models,"Diffusion models have demonstrated exceptional efficacy in various generative +applications. While existing models focus on minimizing a weighted sum of +denoising score matching losses for data distribution modeling, their training +primarily emphasizes instance-level optimization, overlooking valuable +structural information within each mini-batch, indicative of pair-wise +relationships among samples. To address this limitation, we introduce +Structure-guided Adversarial training of Diffusion Models (SADM). In this +pioneering approach, we compel the model to learn manifold structures between +samples in each training batch. To ensure the model captures authentic manifold +structures in the data distribution, we advocate adversarial training of the +diffusion generator against a novel structure discriminator in a minimax game, +distinguishing real manifold structures from the generated ones. SADM +substantially improves existing diffusion transformers (DiT) and outperforms +existing methods in image generation and cross-domain fine-tuning tasks across +12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on +ImageNet for class-conditional image generation at resolutions of 256x256 and +512x512, respectively.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,Chun Feng · Joy Hsu · Weiyu Liu · Jiajun Wu,https://chunfeng3364.github.io/projects/larc_website/project_page.html,https://arxiv.org/abs/2404.19696,,,Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,"3D visual grounding is a challenging task that often requires direct and +dense supervision, notably the semantic label for each object in the scene. In +this paper, we instead study the naturally supervised setting that learns from +only 3D scene and QA pairs, where prior works underperform. We propose the +Language-Regularized Concept Learner (LARC), which uses constraints from +language as regularization to significantly improve the accuracy of +neuro-symbolic concept learners in the naturally supervised setting. Our +approach is based on two core insights: the first is that language constraints +(e.g., a word's relation to another) can serve as effective regularization for +structured representations in neuro-symbolic models; the second is that we can +query large language models to distill such constraints from language +properties. We show that LARC improves performance of prior works in naturally +supervised 3D visual grounding, and demonstrates a wide range of 3D visual +reasoning capabilities-from zero-shot composition, to data efficiency and +transferability. Our method represents a promising step towards regularizing +structured visual reasoning frameworks with language-based priors, for learning +in settings without dense supervision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +SuperPrimitive: Scene Reconstruction at a Primitive Level,Kirill Mazur · Gwangbin Bae · Andrew J. Davison, ,https://arxiv.org/abs/2312.05889,,2312.05889.pdf,SuperPrimitive: Scene Reconstruction at a Primitive Level,"Joint camera pose and dense geometry estimation from a set of images or a +monocular video remains a challenging problem due to its computational +complexity and inherent visual ambiguities. Most dense incremental +reconstruction systems operate directly on image pixels and solve for their 3D +positions using multi-view geometry cues. Such pixel-level approaches suffer +from ambiguities or violations of multi-view consistency (e.g. caused by +textureless or specular surfaces). + We address this issue with a new image representation which we call a +SuperPrimitive. SuperPrimitives are obtained by splitting images into +semantically correlated local regions and enhancing them with estimated surface +normal directions, both of which are predicted by state-of-the-art single image +neural networks. This provides a local geometry estimate per SuperPrimitive, +while their relative positions are adjusted based on multi-view observations. + We demonstrate the versatility of our new representation by addressing three +3D reconstruction tasks: depth completion, few-view structure from motion, and +monocular dense visual odometry.",cs.CV,['cs.CV'] +Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification,Chao Yi · Lu Ren · De-Chuan Zhan · Han-Jia Ye, ,https://arxiv.org/abs/2404.17753,,2404.17753.pdf,Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification,"CLIP showcases exceptional cross-modal matching capabilities due to its +training on image-text contrastive learning tasks. However, without specific +optimization for unimodal scenarios, its performance in single-modality feature +extraction might be suboptimal. Despite this, some studies have directly used +CLIP's image encoder for tasks like few-shot classification, introducing a +misalignment between its pre-training objectives and feature extraction +methods. This inconsistency can diminish the quality of the image's feature +representation, adversely affecting CLIP's effectiveness in target tasks. In +this paper, we view text features as precise neighbors of image features in +CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) +based on the distance structure between images and their neighbor texts. This +feature extraction method aligns better with CLIP's pre-training objectives, +thereby fully leveraging CLIP's robust cross-modal capabilities. The key to +construct a high-quality CODER lies in how to create a vast amount of +high-quality and diverse texts to match with images. We introduce the Auto Text +Generator(ATG) to automatically generate the required texts in a data-free and +training-free manner. We apply CODER to CLIP's zero-shot and few-shot image +classification tasks. Experiment results across various datasets and models +confirm CODER's effectiveness. Code is available +at:https://github.com/YCaigogogo/CVPR24-CODER.",cs.CV,"['cs.CV', 'cs.AI']" +MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections,mude hui · Zihao Wei · Hongru Zhu · Fei Xia · Yuyin Zhou,https://github.com/UCSC-VLAA/MicroDiffusion,https://arxiv.org/abs/2403.10815,,2403.10815.pdf,MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections,"Volumetric optical microscopy using non-diffracting beams enables rapid +imaging of 3D volumes by projecting them axially to 2D images but lacks crucial +depth information. Addressing this, we introduce MicroDiffusion, a pioneering +tool facilitating high-quality, depth-resolved 3D volume reconstruction from +limited 2D projections. While existing Implicit Neural Representation (INR) +models often yield incomplete outputs and Denoising Diffusion Probabilistic +Models (DDPM) excel at capturing details, our method integrates INR's +structural coherence with DDPM's fine-detail enhancement capabilities. We +pretrain an INR model to transform 2D axially-projected images into a +preliminary 3D volume. This pretrained INR acts as a global prior guiding +DDPM's generative process through a linear interpolation between INR outputs +and noise inputs. This strategy enriches the diffusion process with structured +3D information, enhancing detail and reducing noise in localized 2D images. By +conditioning the diffusion model on the closest 2D projection, MicroDiffusion +substantially enhances fidelity in resulting 3D reconstructions, surpassing INR +and standard DDPM outputs with unparalleled image quality and structural +fidelity. Our code and dataset are available at +https://github.com/UCSC-VLAA/MicroDiffusion.",eess.IV,"['eess.IV', 'cs.CV']" +Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data,Yu Deng · Duomin Wang · Xiaohang Ren · Xingyu Chen · Baoyuan Wang,https://github.com/YuDeng/Portrait-4D,https://arxiv.org/abs/2311.18729,,2311.18729.pdf,Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data,"Existing one-shot 4D head synthesis methods usually learn from monocular +videos with the aid of 3DMM reconstruction, yet the latter is evenly +challenging which restricts them from reasonable 4D head synthesis. We present +a method to learn one-shot 4D head synthesis via large-scale synthetic data. +The key is to first learn a part-wise 4D generative model from monocular images +via adversarial learning, to synthesize multi-view images of diverse identities +and full motions as training data; then leverage a transformer-based animatable +triplane reconstructor to learn 4D head reconstruction using the synthetic +data. A novel learning strategy is enforced to enhance the generalizability to +real images by disentangling the learning process of 3D reconstruction and +reenactment. Experiments demonstrate our superiority over the prior art.",cs.CV,['cs.CV'] +Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes,Hmrishav Bandyopadhyay · Subhadeep Koley · Ayan Das · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://hmrishavbandy.github.io/doodle23d/,https://arxiv.org/abs/2312.04043,,2312.04043.pdf,Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes,"In this paper, we democratise 3D content creation, enabling precise +generation of 3D shapes from abstract sketches while overcoming limitations +tied to drawing skills. We introduce a novel part-level modelling and alignment +framework that facilitates abstraction modelling and cross-modal +correspondence. Leveraging the same part-level decoder, our approach seamlessly +extends to sketch modelling by establishing correspondence between CLIPasso +edgemaps and projected 3D part regions, eliminating the need for a dataset +pairing human sketches and 3D shapes. Additionally, our method introduces a +seamless in-position editing process as a byproduct of cross-modal part-aligned +modelling. Operating in a low-dimensional implicit space, our approach +significantly reduces computational demands and processing time.",cs.CV,"['cs.CV', 'cs.AI']" +Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring,Huicong Zhang · Haozhe Xie · Hongxun Yao,https://vilab.hit.edu.cn/projects/bsstnet,,https://github.com/huicongzhang/BSSTNet,,,,,nan +Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications,Junyi Ma · Xieyuanli Chen · Jiawei Huang · Jingyi Xu · Zhen Luo · Jintao Xu · Weihao Gu · Rui Ai · Hesheng Wang,https://github.com/haomo-ai/Cam4DOcc,https://arxiv.org/abs/2311.17663,,2311.17663.pdf,Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications,"Understanding how the surrounding environment changes is crucial for +performing downstream tasks safely and reliably in autonomous driving +applications. Recent occupancy estimation techniques using only camera images +as input can provide dense occupancy representations of large-scale scenes +based on the current observation. However, they are mostly limited to +representing the current 3D space and do not consider the future state of +surrounding objects along the time axis. To extend camera-only occupancy +estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark +for camera-only 4D occupancy forecasting, evaluating the surrounding scene +changes in a near future. We build our benchmark based on multiple publicly +available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, +which provides sequential occupancy states of general movable and static +objects, as well as their 3D backward centripetal flow. To establish this +benchmark for future research with comprehensive comparisons, we introduce four +baseline types from diverse camera-based perception and prediction +implementations, including a static-world occupancy model, voxelization of +point cloud prediction, 2D-3D instance-based prediction, and our proposed novel +end-to-end 4D occupancy forecasting network. Furthermore, the standardized +evaluation protocol for preset multiple tasks is also provided to compare the +performance of all the proposed baselines on present and future occupancy +estimation with respect to objects of interest in autonomous driving scenarios. +The dataset and our implementation of all four baselines in the proposed +Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc.",cs.CV,['cs.CV'] +DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations,Tianhao Qi · Shancheng Fang · Yanze Wu · Hongtao Xie · Jiawei Liu · Lang chen · Qian HE · Yongdong Zhang,https://tianhao-qi.github.io/DEADiff/,https://arxiv.org/abs/2403.06951,,2403.06951.pdf,DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations,"The diffusion-based text-to-image model harbors immense potential in +transferring reference style. However, current encoder-based approaches +significantly impair the text controllability of text-to-image models while +transferring styles. In this paper, we introduce DEADiff to address this issue +using the following two strategies: 1) a mechanism to decouple the style and +semantics of reference images. The decoupled feature representations are first +extracted by Q-Formers which are instructed by different text descriptions. +Then they are injected into mutually exclusive subsets of cross-attention +layers for better disentanglement. 2) A non-reconstructive learning method. The +Q-Formers are trained using paired images rather than the identical target, in +which the reference image and the ground-truth image are with the same style or +semantics. We show that DEADiff attains the best visual stylization results and +optimal balance between the text controllability inherent in the text-to-image +model and style similarity to the reference image, as demonstrated both +quantitatively and qualitatively. Our project page is +https://tianhao-qi.github.io/DEADiff/.",cs.CV,['cs.CV'] +What Sketch Explainability Really Means for Downstream Tasks ?,Hmrishav Bandyopadhyay · Pinaki Nath Chowdhury · Ayan Kumar Bhunia · Aneeshan Sain · Tao Xiang · Yi-Zhe Song, ,https://arxiv.org/abs/2403.09480,,2403.09480.pdf,What Sketch Explainability Really Means for Downstream Tasks,"In this paper, we explore the unique modality of sketch for explainability, +emphasising the profound impact of human strokes compared to conventional +pixel-oriented studies. Beyond explanations of network behavior, we discern the +genuine implications of explainability across diverse downstream sketch-related +tasks. We propose a lightweight and portable explainability solution -- a +seamless plugin that integrates effortlessly with any pre-trained model, +eliminating the need for re-training. Demonstrating its adaptability, we +present four applications: highly studied retrieval and generation, and +completely novel assisted drawing and sketch adversarial attacks. The +centrepiece to our solution is a stroke-level attribution map that takes +different forms when linked with downstream tasks. By addressing the inherent +non-differentiability of rasterisation, we enable explanations at both coarse +stroke level (SLA) and partial stroke level (P-SLA), each with its advantages +for specific downstream tasks.",cs.CV,"['cs.CV', 'cs.AI']" +OHTA: One-shot Hand Avatar via Data-driven Implicit Priors,Xiaozheng Zheng · Chao Wen · Zhuo Su · Zeran Xu · Zhaohu Li · Yang Zhao · Zhou Xue,https://zxz267.github.io/OHTA/,https://arxiv.org/abs/2402.18969,,2402.18969.pdf,OHTA: One-shot Hand Avatar via Data-driven Implicit Priors,"In this paper, we delve into the creation of one-shot hand avatars, attaining +high-fidelity and drivable hand representations swiftly from a single image. +With the burgeoning domains of the digital human, the need for quick and +personalized hand avatar creation has become increasingly critical. Existing +techniques typically require extensive input data and may prove cumbersome or +even impractical in certain scenarios. To enhance accessibility, we present a +novel method OHTA (One-shot Hand avaTAr) that enables the creation of detailed +hand avatars from merely one image. OHTA tackles the inherent difficulties of +this data-limited problem by learning and utilizing data-driven hand priors. +Specifically, we design a hand prior model initially employed for 1) learning +various hand priors with available data and subsequently for 2) the inversion +and fitting of the target identity with prior knowledge. OHTA demonstrates the +capability to create high-fidelity hand avatars with consistent animatable +quality, solely relying on a single image. Furthermore, we illustrate the +versatility of OHTA through diverse applications, encompassing text-to-avatar +conversion, hand editing, and identity latent space manipulation.",cs.CV,['cs.CV'] +FedUV: Uniformity and Variance for Heterogeneous Federated Learning,Ha Min Son · Moon-Hyun Kim · Tai-Myoung Chung · Chao Huang · Xin Liu,https://github.com/sonhamin/FedUV,https://arxiv.org/abs/2402.18372,,2402.18372.pdf,FedUV: Uniformity and Variance for Heterogeneous Federated Learning,"Federated learning is a promising framework to train neural networks with +widely distributed data. However, performance degrades heavily with +heterogeneously distributed data. Recent work has shown this is due to the +final layer of the network being most prone to local bias, some finding success +freezing the final layer as an orthogonal classifier. We investigate the +training dynamics of the classifier by applying SVD to the weights motivated by +the observation that freezing weights results in constant singular values. We +find that there are differences when training in IID and non-IID settings. +Based on this finding, we introduce two regularization terms for local training +to continuously emulate IID settings: (1) variance in the dimension-wise +probability distribution of the classifier and (2) hyperspherical uniformity of +representations of the encoder. These regularizations promote local models to +act as if it were in an IID setting regardless of the local data distribution, +thus offsetting proneness to bias while being flexible to the data. On +extensive experiments in both label-shift and feature-shift settings, we verify +that our method achieves highest performance by a large margin especially in +highly non-IID cases in addition to being scalable to larger models and +datasets.",cs.LG,"['cs.LG', 'cs.AI', 'cs.DC']" +WinSyn: A High Resolution Testbed for Synthetic Data,Tom Kelly · John Femiani · Peter Wonka, ,https://arxiv.org/abs/2310.08471,,2310.08471.pdf,WinSyn: A High Resolution Testbed for Synthetic Data,"We present WinSyn, a unique dataset and testbed for creating high-quality +synthetic data with procedural modeling techniques. The dataset contains +high-resolution photographs of windows, selected from locations around the +world, with 89,318 individual window crops showcasing diverse geometric and +material characteristics. We evaluate a procedural model by training semantic +segmentation networks on both synthetic and real images and then comparing +their performances on a shared test set of real images. Specifically, we +measure the difference in mean Intersection over Union (mIoU) and determine the +effective number of real images to match synthetic data's training performance. +We design a baseline procedural model as a benchmark and provide 21,290 +synthetically generated images. By tuning the procedural model, key factors are +identified which significantly influence the model's fidelity in replicating +real-world scenarios. Importantly, we highlight the challenge of procedural +modeling using current techniques, especially in their ability to replicate the +spatial semantics of real-world scenarios. This insight is critical because of +the potential of procedural models to bridge to hidden scene aspects such as +depth, reflectivity, material properties, and lighting conditions.",cs.CV,"['cs.CV', 'cs.GR']" +Rethinking Inductive Biases for Surface Normal Estimation,Gwangbin Bae · Andrew J. Davison, ,https://arxiv.org/abs/2403.00712,,2403.00712.pdf,Rethinking Inductive Biases for Surface Normal Estimation,"Despite the growing demand for accurate surface normal estimation models, +existing methods use general-purpose dense prediction models, adopting the same +inductive biases as other tasks. In this paper, we discuss the inductive biases +needed for surface normal estimation and propose to (1) utilize the per-pixel +ray direction and (2) encode the relationship between neighboring surface +normals by learning their relative rotation. The proposed method can generate +crisp - yet, piecewise smooth - predictions for challenging in-the-wild images +of arbitrary resolution and aspect ratio. Compared to a recent ViT-based +state-of-the-art model, our method shows a stronger generalization ability, +despite being trained on an orders of magnitude smaller dataset. The code is +available at https://github.com/baegwangbin/DSINE.",cs.CV,['cs.CV'] +MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding,Xu Cao · Tong Zhou · Yunsheng Ma · Wenqian Ye · Can Cui · Kun Tang · Zhipeng Cao · Kaizhao Liang · Ziran Wang · James Rehg · chao zheng, ,,https://ysma.me/,,,,,nan +In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging,Xin Wang · Lizhi Wang · Xiangtian Ma · Maoqing Zhang · Lin Zhu · Hua Huang,https://github.com/2JONAS/In2SET,https://arxiv.org/abs/2312.13319,,2312.13319.pdf,In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging,"Dual-Camera Compressed Hyperspectral Imaging (DCCHI) offers the capability to +reconstruct 3D Hyperspectral Image (HSI) by fusing compressive and Panchromatic +(PAN) image, which has shown great potential for snapshot hyperspectral imaging +in practice. In this paper, we introduce a novel DCCHI reconstruction network, +the Intra-Inter Similarity Exploiting Transformer (In2SET). Our key insight is +to make full use of the PAN image to assist the reconstruction. To this end, we +propose using the intra-similarity within the PAN image as a proxy for +approximating the intra-similarity in the original HSI, thereby offering an +enhanced content prior for more accurate HSI reconstruction. Furthermore, we +aim to align the features from the underlying HSI with those of the PAN image, +maintaining semantic consistency and introducing new contextual information for +the reconstruction process. By integrating In2SET into a PAN-guided unrolling +framework, our method substantially enhances the spatial-spectral fidelity and +detail of the reconstructed images, providing a more comprehensive and accurate +depiction of the scene. Extensive experiments conducted on both real and +simulated datasets demonstrate that our approach consistently outperforms +existing state-of-the-art methods in terms of reconstruction quality and +computational complexity. Code will be released.",eess.IV,"['eess.IV', 'cs.CV']" +Describing Differences in Image Sets with Natural Language,Lisa Dunlap · Yuhui Zhang · Xiaohan Wang · Ruiqi Zhong · Trevor Darrell · Jacob Steinhardt · Joseph Gonzalez · Serena Yeung,https://understanding-visual-datasets.github.io/VisDiff-website/,https://arxiv.org/abs/2312.02974,,2312.02974.pdf,Describing Differences in Image Sets with Natural Language,"How do two sets of images differ? Discerning set-level differences is crucial +for understanding model behaviors and analyzing datasets, yet manually sifting +through thousands of images is impractical. To aid in this discovery process, +we explore the task of automatically describing the differences between two +$\textbf{sets}$ of images, which we term Set Difference Captioning. This task +takes in image sets $D_A$ and $D_B$, and outputs a description that is more +often true on $D_A$ than $D_B$. We outline a two-stage approach that first +proposes candidate difference descriptions from image sets and then re-ranks +the candidates by checking how well they can differentiate the two sets. We +introduce VisDiff, which first captions the images and prompts a language model +to propose candidate descriptions, then re-ranks these descriptions using CLIP. +To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image +sets with ground truth difference descriptions. We apply VisDiff to various +domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing +classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing +model failure modes (supervised ResNet), characterizing differences between +generative models (e.g., StableDiffusionV1 and V2), and discovering what makes +images memorable. Using VisDiff, we are able to find interesting and previously +unknown differences in datasets and models, demonstrating its utility in +revealing nuanced insights.",cs.CV,"['cs.CV', 'cs.CL', 'cs.CY', 'cs.LG']" +SketchINR: A First Look into Sketches as Implicit Neural Representations,Hmrishav Bandyopadhyay · Ayan Kumar Bhunia · Pinaki Nath Chowdhury · Aneeshan Sain · Tao Xiang · Timothy Hospedales · Yi-Zhe Song,https://hmrishavbandy.github.io/sketchinr,https://arxiv.org/abs/2403.09344,,2403.09344.pdf,SketchINR: A First Look into Sketches as Implicit Neural Representations,"We propose SketchINR, to advance the representation of vector sketches with +implicit neural models. A variable length vector sketch is compressed into a +latent space of fixed dimension that implicitly encodes the underlying shape as +a function of time and strokes. The learned function predicts the $xy$ point +coordinates in a sketch at each time and stroke. Despite its simplicity, +SketchINR outperforms existing representations at multiple tasks: (i) Encoding +an entire sketch dataset into a fixed size latent vector, SketchINR gives +$60\times$ and $10\times$ data compression over raster and vector sketches, +respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity +representation than other learned vector sketch representations, and is +uniquely able to scale to complex vector sketches such as FS-COCO. (iii) +SketchINR supports parallelisation that can decode/render $\sim$$100\times$ +faster than other learned vector representations such as SketchRNN. (iv) +SketchINR, for the first time, emulates the human ability to reproduce a sketch +with varying abstraction in terms of number and complexity of strokes. As a +first look at implicit sketches, SketchINR's compact high-fidelity +representation will support future work in modelling long and complex sketches.",cs.CV,"['cs.CV', 'cs.AI']" +Commonsense Prototype for Outdoor Unsupervised 3D Object Detection,Hai Wu · Shijia Zhao · Xun Huang · Chenglu Wen · Xin Li · Cheng Wang,https://github.com/hailanyi/CPD,https://arxiv.org/abs/2404.16493,,2404.16493.pdf,Commonsense Prototype for Outdoor Unsupervised 3D Object Detection,"The prevalent approaches of unsupervised 3D object detection follow +cluster-based pseudo-label generation and iterative self-training processes. +However, the challenge arises due to the sparsity of LiDAR scans, which leads +to pseudo-labels with erroneous size and position, resulting in subpar +detection performance. To tackle this problem, this paper introduces a +Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object +detection. CPD first constructs Commonsense Prototype (CProto) characterized by +high-quality bounding box and dense points, based on commonsense intuition. +Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size +prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely +scanned objects by the geometric knowledge from CProto. CPD outperforms +state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), +PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD +and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on +easy and moderate car classes, respectively. These achievements position CPD in +close proximity to fully supervised detectors, highlighting the significance of +our method. The code will be available at https://github.com/hailanyi/CPD.",cs.CV,['cs.CV'] +Global and Hierarchical Geometry Consistency Priors for Few-shot NeRFs in Indoor Scenes,Xiaotian Sun · Qingshan Xu · Xinjie Yang · Yu Zang · Cheng Wang, ,https://arxiv.org/html/2404.00992v1,,2404.00992v1.pdf,SGCNeRF: Few-Shot Neural Rendering via Sparse Geometric Consistency Guidance,"Neural Radiance Field (NeRF) technology has made significant strides in +creating novel viewpoints. However, its effectiveness is hampered when working +with sparsely available views, often leading to performance dips due to +overfitting. FreeNeRF attempts to overcome this limitation by integrating +implicit geometry regularization, which incrementally improves both geometry +and textures. Nonetheless, an initial low positional encoding bandwidth results +in the exclusion of high-frequency elements. The quest for a holistic approach +that simultaneously addresses overfitting and the preservation of +high-frequency details remains ongoing. This study introduces a novel feature +matching based sparse geometry regularization module. This module excels in +pinpointing high-frequency keypoints, thereby safeguarding the integrity of +fine details. Through progressive refinement of geometry and textures across +NeRF iterations, we unveil an effective few-shot neural rendering architecture, +designated as SGCNeRF, for enhanced novel view synthesis. Our experiments +demonstrate that SGCNeRF not only achieves superior geometry-consistent +outcomes but also surpasses FreeNeRF, with improvements of 0.7 dB and 0.6 dB in +PSNR on the LLFF and DTU datasets, respectively.",cs.CV,['cs.CV'] +Segment Every Out-of-Distribution Object,Wenjie Zhao · Jia Li · Xin Dong · Yu Xiang · Yunhui Guo, ,https://arxiv.org/abs/2311.16516,,2311.16516.pdf,Segment Every Out-of-Distribution Object,"Semantic segmentation models, while effective for in-distribution categories, +face challenges in real-world deployment due to encountering +out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for +safety-critical applications. Existing methods rely on anomaly scores, but +choosing a suitable threshold for generating masks presents difficulties and +can lead to fragmentation and inaccuracy. This paper introduces a method to +convert anomaly \textbf{S}core \textbf{T}o segmentation \textbf{M}ask, called +S2M, a simple and effective framework for OoD detection in semantic +segmentation. Unlike assigning anomaly scores to pixels, S2M directly segments +the entire OoD object. By transforming anomaly scores into prompts for a +promptable segmentation model, S2M eliminates the need for threshold selection. +Extensive experiments demonstrate that S2M outperforms the state-of-the-art by +approximately 20% in IoU and 40% in mean F1 score, on average, across various +benchmarks including Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly +datasets.",cs.CV,['cs.CV'] +Learning to Segment Referred Objects from Narrated Egocentric Videos,Yuhan Shen · Huiyu Wang · Xitong Yang · Matt Feiszli · Ehsan Elhamifar · Lorenzo Torresani · Effrosyni Mavroudi, ,https://arxiv.org/abs/2404.05206,,2404.05206.pdf,SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos,"We propose a novel self-supervised embedding to learn how actions sound from +narrated in-the-wild egocentric videos. Whereas existing methods rely on +curated data with known audio-visual correspondence, our multimodal +contrastive-consensus coding (MC3) embedding reinforces the associations +between audio, language, and vision when all modality pairs agree, while +diminishing those associations when any one pair does not. We show our approach +can successfully discover how the long tail of human actions sound from +egocentric video, outperforming an array of recent multimodal embedding +techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal +tasks.",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']" +Low-Resource Vision Challenges for Foundation Models,Yunhua Zhang · Hazel Doughty · Cees G. M. Snoek, ,https://arxiv.org/abs/2401.04716,,2401.04716.pdf,Low-Resource Vision Challenges for Foundation Models,"Low-resource settings are well-established in natural language processing, +where many languages lack sufficient data for deep learning at scale. However, +low-resource problems are under-explored in computer vision. In this paper, we +address this gap and explore the challenges of low-resource image tasks with +vision foundation models. We first collect a benchmark of genuinely +low-resource image data, covering historic maps, circuit diagrams, and +mechanical drawings. These low-resource settings all share three challenges: +data scarcity, fine-grained differences, and the distribution shift from +natural images to the specialized domain of interest. While existing foundation +models have shown impressive generalizability, we find they cannot transfer +well to our low-resource tasks. To begin to tackle the challenges of +low-resource vision, we introduce one simple baseline per challenge. +Specifically, we i) enlarge the data space by generative models, ii) adopt the +best sub-kernels to encode local regions for fine-grained difference discovery +and iii) learn attention for specialized domains. Experiments on our three +low-resource tasks demonstrate our proposals already provide a better baseline +than transfer learning, data augmentation, and fine-grained methods. This +highlights the unique characteristics and challenges of low-resource vision for +foundation models that warrant further investigation. Project page: +https://xiaobai1217.github.io/Low-Resource-Vision/.",cs.CV,['cs.CV'] +SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors,Dave Zhenyu Chen · Haoxuan Li · Hsin-Ying Lee · Sergey Tulyakov · Matthias Nießner,https://daveredrum.github.io/SceneTex/,https://arxiv.org/abs/2311.17261,,2311.17261.pdf,SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors,"We propose SceneTex, a novel method for effectively generating high-quality +and style-consistent textures for indoor scenes using depth-to-image diffusion +priors. Unlike previous methods that either iteratively warp 2D views onto a +mesh surface or distillate diffusion latent features without accurate geometric +and style cues, SceneTex formulates the texture synthesis task as an +optimization problem in the RGB space where style and geometry consistency are +properly reflected. At its core, SceneTex proposes a multiresolution texture +field to implicitly encode the mesh appearance. We optimize the target texture +via a score-distillation-based objective function in respective RGB renderings. +To further secure the style consistency across views, we introduce a +cross-attention decoder to predict the RGB values by cross-attending to the +pre-sampled reference locations in each instance. SceneTex enables various and +accurate texture synthesis for 3D-FRONT scenes, demonstrating significant +improvements in visual quality and prompt fidelity over the prior texture +generation methods.",cs.CV,['cs.CV'] +TCP: Textual-based Class-aware Prompt tuning for Visual-Language Model,Hantao Yao · Rui Zhang · Changsheng Xu, ,https://arxiv.org/abs/2311.18231,,2311.18231.pdf,TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model,"Prompt tuning represents a valuable technique for adapting pre-trained +visual-language models (VLM) to various downstream tasks. Recent advancements +in CoOp-based methods propose a set of learnable domain-shared or +image-conditional textual tokens to facilitate the generation of task-specific +textual classifiers. However, those textual tokens have a limited +generalization ability regarding unseen domains, as they cannot dynamically +adjust to the distribution of testing classes. To tackle this issue, we present +a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly +incorporates prior knowledge about classes to enhance their discriminability. +The critical concept of TCP involves leveraging Textual Knowledge Embedding +(TKE) to map the high generalizability of class-level textual knowledge into +class-aware textual tokens. By seamlessly integrating these class-aware prompts +into the Text Encoder, a dynamic class-aware classifier is generated to enhance +discriminability for unseen domains. During inference, TKE dynamically +generates class-aware prompts related to the unseen classes. Comprehensive +evaluations demonstrate that TKE serves as a plug-and-play module effortlessly +combinable with existing methods. Furthermore, TCP consistently achieves +superior performance while demanding less training time. +Code:https://github.com/htyao89/Textual-based_Class-aware_prompt_tuning/",cs.CV,['cs.CV'] +URHand: Universal Relightable Hands,Zhaoxi Chen · Gyeongsik Moon · Kaiwen Guo · Chen Cao · Stanislav Pidhorskyi · Tomas Simon · Rohan Joshi · Yuan Dong · Yichen Xu · Bernardo Pires · He Wen · Lucas Evans · Bo Peng · Julia Buffalini · Autumn Trimble · Kevyn McPhail · Melissa Schoeller · Shoou-I Yu · Javier Romero · Michael Zollhoefer · Yaser Sheikh · Ziwei Liu · Shunsuke Saito,https://frozenburning.github.io/projects/urhand/,http://export.arxiv.org/abs/2401.05334,,2401.05334.pdf,URHand: Universal Relightable Hands,"Existing photorealistic relightable hand models require extensive +identity-specific observations in different views, poses, and illuminations, +and face challenges in generalizing to natural illuminations and novel +identities. To bridge this gap, we present URHand, the first universal +relightable hand model that generalizes across viewpoints, poses, +illuminations, and identities. Our model allows few-shot personalization using +images captured with a mobile phone, and is ready to be photorealistically +rendered under novel illuminations. To simplify the personalization process +while retaining photorealism, we build a powerful universal relightable prior +based on neural relighting from multi-view images of hands captured in a light +stage with hundreds of identities. The key challenge is scaling the +cross-identity training while maintaining personalized fidelity and sharp +details without compromising generalization under natural illuminations. To +this end, we propose a spatially varying linear lighting model as the neural +renderer that takes physics-inspired shading as input feature. By removing +non-linear activations and bias, our specifically designed lighting model +explicitly keeps the linearity of light transport. This enables single-stage +training from light-stage data while generalizing to real-time rendering under +arbitrary continuous illuminations across diverse identities. In addition, we +introduce the joint learning of a physically based model and our neural +relighting model, which further improves fidelity and generalization. Extensive +experiments show that our approach achieves superior performance over existing +methods in terms of both quality and generalizability. We also demonstrate +quick personalization of URHand from a short phone scan of an unseen identity.",cs.CV,"['cs.CV', 'cs.GR']" +EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams,Christen Millerdurai · Hiroyasu Akada · Jian Wang · Diogo Luvizon · Christian Theobalt · Vladislav Golyanik,https://4dqv.mpi-inf.mpg.de/EventEgo3D/,https://arxiv.org/abs/2404.08640,,2404.08640.pdf,EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams,"Monocular egocentric 3D human motion capture is a challenging and actively +researched problem. Existing methods use synchronously operating visual sensors +(e.g. RGB cameras) and often fail under low lighting and fast motions, which +can be restricting in many applications involving head-mounted devices. In +response to the existing limitations, this paper 1) introduces a new problem, +i.e., 3D human motion capture from an egocentric monocular event camera with a +fisheye lens, and 2) proposes the first approach to it called EventEgo3D +(EE3D). Event streams have high temporal resolution and provide reliable cues +for 3D human motion capture under high-speed human motions and rapidly changing +illumination. The proposed EE3D framework is specifically tailored for learning +with event streams in the LNES representation, enabling high 3D reconstruction +accuracy. We also design a prototype of a mobile head-mounted device with an +event camera and record a real dataset with event observations and the +ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D +demonstrates robustness and superior 3D accuracy compared to existing solutions +across various challenging experiments while supporting real-time 3D pose +update rates of 140Hz.",cs.CV,['cs.CV'] +WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights,Youngdong Jang · Dong In Lee · MinHyuk Jang · Jong Wook Kim · Feng Yang · Sangpil Kim,https://kuai-lab.github.io/cvpr2024waterf/,https://arxiv.org/abs/2405.02066,,2405.02066.pdf,WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights,"The advances in the Neural Radiance Fields (NeRF) research offer extensive +applications in diverse domains, but protecting their copyrights has not yet +been researched in depth. Recently, NeRF watermarking has been considered one +of the pivotal solutions for safely deploying NeRF-based 3D representations. +However, existing methods are designed to apply only to implicit or explicit +NeRF representations. In this work, we introduce an innovative watermarking +method that can be employed in both representations of NeRF. This is achieved +by fine-tuning NeRF to embed binary messages in the rendering process. In +detail, we propose utilizing the discrete wavelet transform in the NeRF space +for watermarking. Furthermore, we adopt a deferred back-propagation technique +and introduce a combination with the patch-wise loss to improve rendering +quality and bit accuracy with minimum trade-offs. We evaluate our method in +three different aspects: capacity, invisibility, and robustness of the embedded +watermarks in the 2D-rendered images. Our method achieves state-of-the-art +performance with faster training speed over the compared state-of-the-art +methods.",cs.CV,"['cs.CV', 'eess.IV']" +ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association,Shuxiao Ding · Lukas Schneider · Marius Cordts · Jürgen Gall,https://github.com/dsx0511/ADA-Track,https://arxiv.org/abs/2405.08909,,2405.08909.pdf,ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association,"Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the +tracking-by-attention paradigm, utilizing track queries for identity-consistent +detection and object queries for identity-agnostic track spawning. +Tracking-by-attention, however, entangles detection and tracking queries in one +embedding for both the detection and tracking task, which is sub-optimal. Other +approaches resemble the tracking-by-detection paradigm, detecting objects using +decoupled track and detection queries followed by a subsequent association. +These methods, however, do not leverage synergies between the detection and +association task. Combining the strengths of both paradigms, we introduce +ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We +introduce a learnable data association module based on edge-augmented +cross-attention, leveraging appearance and geometric features. Furthermore, we +integrate this association module into the decoder layer of a DETR-based 3D +detector, enabling simultaneous DETR-like query-to-image cross-attention for +detection and query-to-query cross-attention for data association. By stacking +these decoder layers, queries are refined for the detection and association +task alternately, effectively harnessing the task dependencies. We evaluate our +method on the nuScenes dataset and demonstrate the advantage of our approach +compared to the two previous paradigms. Code is available at +https://github.com/dsx0511/ADA-Track.",cs.CV,['cs.CV'] +Scale Decoupled Distillation,Shicai Wei · Chunbo Luo · Yang Luo, ,https://arxiv.org/abs/2403.13512,,2403.13512.pdf,Scale Decoupled Distillation,"Logit knowledge distillation attracts increasing attention due to its +practicality in recent studies. However, it often suffers inferior performance +compared to the feature knowledge distillation. In this paper, we argue that +existing logit-based methods may be sub-optimal since they only leverage the +global logit output that couples multiple semantic knowledge. This may transfer +ambiguous knowledge to the student and mislead its learning. To this end, we +propose a simple but effective method, i.e., Scale Decoupled Distillation +(SDD), for logit knowledge distillation. SDD decouples the global logit output +into multiple local logit outputs and establishes distillation pipelines for +them. This helps the student to mine and inherit fine-grained and unambiguous +logit knowledge. Moreover, the decoupled knowledge can be further divided into +consistent and complementary logit knowledge that transfers the semantic +information and sample ambiguity, respectively. By increasing the weight of +complementary parts, SDD can guide the student to focus more on ambiguous +samples, improving its discrimination ability. Extensive experiments on several +benchmark datasets demonstrate the effectiveness of SDD for wide +teacher-student pairs, especially in the fine-grained classification task. Code +is available at: https://github.com/shicaiwei123/SDD-CVPR2024",cs.CV,"['cs.CV', 'cs.AI']" +SIRA: Scalable Inter-frame Relation and Association for Radar Perception,Ryoma Yataka · Pu (Perry) Wang · Petros Boufounos · Ryuhei Takahashi, ,,https://www.semanticscholar.org/paper/Radar-Perception-with-Scalable-Connective-Temporal-Yataka-Wang/78d83560c7e2aee39d8153bafc815482dcbd163e,,,,,nan +Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning,Rui Li · Tobias Fischer · Mattia Segu · Marc Pollefeys · Luc Van Gool · Federico Tombari, ,https://arxiv.org/abs/2404.03658,,2404.03658.pdf,Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning,"Recovering the 3D scene geometry from a single view is a fundamental yet +ill-posed problem in computer vision. While classical depth estimation methods +infer only a 2.5D scene representation limited to the image plane, recent +approaches based on radiance fields reconstruct a full 3D representation. +However, these methods still struggle with occluded regions since inferring +geometry without visual observation requires (i) semantic knowledge of the +surroundings, and (ii) reasoning about spatial context. We propose KYN, a novel +method for single-view scene reconstruction that reasons about semantic and +spatial context to predict each point's density. We introduce a vision-language +modulation module to enrich point features with fine-grained semantic +information. We aggregate point representations across the scene through a +language-guided spatial attention mechanism to yield per-point density +predictions aware of the 3D semantic context. We show that KYN improves 3D +shape recovery compared to predicting density for each 3D point in isolation. +We achieve state-of-the-art results in scene and object reconstruction on +KITTI-360, and show improved zero-shot generalization compared to prior work. +Project page: https://ruili3.github.io/kyn.",cs.CV,['cs.CV'] +Rich Human Feedback for Text-to-Image Generation,Youwei Liang · Junfeng He · Gang Li · Peizhao Li · Arseniy Klimovskiy · Nicholas Carolan · Jiao Sun · Jordi Pont-Tuset · Sarah Young · Feng Yang · Junjie Ke · Krishnamurthy Dvijotham · Katherine Collins · Yiwen Luo · Yang Li · Kai Kohlhoff · Deepak Ramachandran · Vidhya Navalpakkam, ,https://arxiv.org/abs/2312.10240,,2312.10240.pdf,Rich Human Feedback for Text-to-Image Generation,"Recent Text-to-Image (T2I) generation models such as Stable Diffusion and +Imagen have made significant progress in generating high-resolution images +based on text descriptions. However, many generated images still suffer from +issues such as artifacts/implausibility, misalignment with text descriptions, +and low aesthetic quality. Inspired by the success of Reinforcement Learning +with Human Feedback (RLHF) for large language models, prior works collected +human-provided scores as feedback on generated images and trained a reward +model to improve the T2I generation. In this paper, we enrich the feedback +signal by (i) marking image regions that are implausible or misaligned with the +text, and (ii) annotating which words in the text prompt are misrepresented or +missing on the image. We collect such rich human feedback on 18K generated +images (RichHF-18K) and train a multimodal transformer to predict the rich +feedback automatically. We show that the predicted rich human feedback can be +leveraged to improve image generation, for example, by selecting high-quality +training data to finetune and improve the generative models, or by creating +masks with predicted heatmaps to inpaint the problematic regions. Notably, the +improvements generalize to models (Muse) beyond those used to generate the +images on which human feedback data were collected (Stable Diffusion variants). +The RichHF-18K data set will be released in our GitHub repository: +https://github.com/google-research/google-research/tree/master/richhf_18k.",cs.CV,['cs.CV'] +"AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond",Zixiang Zhou · Yu Wan · Baoyuan Wang,https://github.com/zixiangzhou916/AvatarGPT,,https://www.semanticscholar.org/paper/AvatarGPT:-All-in-One-Framework-for-Motion-and-Zhou-Wan/b4e6f30ab07666dc7d485b24f072f2533609545c/figure/4,,,,,nan +SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection,Gang Zhang · Chen Junnan · Guohuan Gao · Jianmin Li · Si Liu · Xiaolin Hu, ,https://arxiv.org/abs/2403.05817,,2403.05817.pdf,SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection,"LiDAR-based 3D object detection plays an essential role in autonomous +driving. Existing high-performing 3D object detectors usually build dense +feature maps in the backbone network and prediction head. However, the +computational costs introduced by the dense feature maps grow quadratically as +the perception range increases, making these models hard to scale up to +long-range detection. Some recent works have attempted to construct fully +sparse detectors to solve this issue; nevertheless, the resulting models either +rely on a complex multi-stage pipeline or exhibit inferior performance. In this +work, we propose SAFDNet, a straightforward yet highly effective architecture, +tailored for fully sparse 3D object detection. In SAFDNet, an adaptive feature +diffusion strategy is designed to address the center feature missing problem. +We conducted extensive experiments on Waymo Open, nuScenes, and Argoverse2 +datasets. SAFDNet performed slightly better than the previous SOTA on the first +two datasets but much better on the last dataset, which features long-range +detection, verifying the efficacy of SAFDNet in scenarios where long-range +detection is required. Notably, on Argoverse2, SAFDNet surpassed the previous +best hybrid detector HEDNet by 2.6% mAP while being 2.1x faster, and yielded +2.1% mAP gains over the previous best sparse detector FSDv2 while being 1.3x +faster. The code will be available at https://github.com/zhanggang001/HEDNet.",cs.CV,['cs.CV'] +Neural Clustering based Visual Representation Learning,Guikun Chen · Xia Li · Yi Yang · Wenguan Wang,https://github.com/guikunchen/FEC,https://arxiv.org/abs/2403.17409,,2403.17409.pdf,Neural Clustering based Visual Representation Learning,"We investigate a fundamental aspect of machine vision: the measurement of +features, by revisiting clustering, one of the most classic approaches in +machine learning and data analysis. Existing visual feature extractors, +including ConvNets, ViTs, and MLPs, represent an image as rectangular regions. +Though prevalent, such a grid-style paradigm is built upon engineering practice +and lacks explicit modeling of data distribution. In this work, we propose +feature extraction with clustering (FEC), a conceptually elegant yet +surprisingly ad-hoc interpretable neural clustering framework, which views +feature extraction as a process of selecting representatives from data and thus +automatically captures the underlying data distribution. Given an image, FEC +alternates between grouping pixels into individual clusters to abstract +representatives and updating the deep features of pixels with current +representatives. Such an iterative working mechanism is implemented in the form +of several neural layers and the final representatives can be used for +downstream tasks. The cluster assignments across layers, which can be viewed +and inspected by humans, make the forward process of FEC fully transparent and +empower it with promising ad-hoc interpretability. Extensive experiments on +various visual recognition models and tasks verify the effectiveness, +generality, and interpretability of FEC. We expect this work will provoke a +rethink of the current de facto grid-style paradigm.",cs.CV,['cs.CV'] +Neural Redshift: Random Networks are not Random Functions,Damien Teney · Armand Nicolicioiu · Valentin Hartmann · Ehsan Abbasnejad, ,https://arxiv.org/abs/2403.02241,,2403.02241.pdf,Neural Redshift: Random Networks are not Random Functions,"Our understanding of the generalization capabilities of neural networks (NNs) +is still incomplete. Prevailing explanations are based on implicit biases of +gradient descent (GD) but they cannot account for the capabilities of models +from gradient-free methods nor the simplicity bias recently observed in +untrained networks. This paper seeks other sources of generalization in NNs. + Findings. To understand the inductive biases provided by architectures +independently from GD, we examine untrained, random-weight networks. Even +simple MLPs show strong inductive biases: uniform sampling in weight space +yields a very biased distribution of functions in terms of complexity. But +unlike common wisdom, NNs do not have an inherent ""simplicity bias"". This +property depends on components such as ReLUs, residual connections, and layer +normalizations. Alternative architectures can be built with a bias for any +level of complexity. Transformers also inherit all these properties from their +building blocks. + Implications. We provide a fresh explanation for the success of deep learning +independent from gradient-based training. It points at promising avenues for +controlling the solutions implemented by trained models.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +Segment Any Event Streams via Weighted Adaptation of Pivotal Tokens,Zhiwen Chen · Zhiyu Zhu · Yifan Zhang · Junhui Hou · Guangming Shi · Jinjian Wu,https://github.com/happychenpipi/EventSAM/,https://arxiv.org/abs/2312.16222,,2312.16222.pdf,Segment Any Events via Weighted Adaptation of Pivotal Tokens,"In this paper, we delve into the nuanced challenge of tailoring the Segment +Anything Models (SAMs) for integration with event data, with the overarching +objective of attaining robust and universal object segmentation within the +event-centric domain. One pivotal issue at the heart of this endeavor is the +precise alignment and calibration of embeddings derived from event-centric data +such that they harmoniously coincide with those originating from RGB imagery. +Capitalizing on the vast repositories of datasets with paired events and RGB +images, our proposition is to harness and extrapolate the profound knowledge +encapsulated within the pre-trained SAM framework. As a cornerstone to +achieving this, we introduce a multi-scale feature distillation methodology. +This methodology rigorously optimizes the alignment of token embeddings +originating from event data with their RGB image counterparts, thereby +preserving and enhancing the robustness of the overall architecture. +Considering the distinct significance that token embeddings from intermediate +layers hold for higher-level embeddings, our strategy is centered on accurately +calibrating the pivotal token embeddings. This targeted calibration is aimed at +effectively managing the discrepancies in high-level embeddings originating +from both the event and image domains. Extensive experiments on different +datasets demonstrate the effectiveness of the proposed distillation method. +Code in http://github.com/happychenpipi/EventSAM.",cs.CV,['cs.CV'] +Continual Forgetting for Pre-trained Vision Models,Hongbo Zhao · Bolin Ni · Junsong Fan · Yuxi Wang · Yuntao Chen · Gaofeng Meng · Zhaoxiang Zhang,https://github.com/bjzhb666/GS-LoRA,https://arxiv.org/abs/2403.11530,,2403.11530.pdf,Continual Forgetting for Pre-trained Vision Models,"For privacy and security concerns, the need to erase unwanted information +from pre-trained vision models is becoming evident nowadays. In real-world +scenarios, erasure requests originate at any time from both users and model +owners. These requests usually form a sequence. Therefore, under such a +setting, selective information is expected to be continuously removed from a +pre-trained model while maintaining the rest. We define this problem as +continual forgetting and identify two key challenges. (i) For unwanted +knowledge, efficient and effective deleting is crucial. (ii) For remaining +knowledge, the impact brought by the forgetting procedure should be minimal. To +address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards +(i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for +each forgetting task independently, and towards (ii), a simple group sparse +regularization is adopted, enabling automatic selection of specific LoRA groups +and zeroing out the others. GS-LoRA is effective, parameter-efficient, +data-efficient, and easy to implement. We conduct extensive experiments on face +recognition, object detection and image classification and demonstrate that +GS-LoRA manages to forget specific classes with minimal impact on other +classes. Codes will be released on \url{https://github.com/bjzhb666/GS-LoRA}.",cs.CV,['cs.CV'] +Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling,Jianan Fan · Dongnan Liu · Hang Chang · Heng Huang · Mei Chen · Weidong Cai, ,https://arxiv.org/abs/2403.01053,,2403.01053.pdf,Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling,"Machine learning holds tremendous promise for transforming the fundamental +practice of scientific discovery by virtue of its data-driven nature. With the +ever-increasing stream of research data collection, it would be appealing to +autonomously explore patterns and insights from observational data for +discovering novel classes of phenotypes and concepts. However, in the +biomedical domain, there are several challenges inherently presented in the +cumulated data which hamper the progress of novel class discovery. The +non-i.i.d. data distribution accompanied by the severe imbalance among +different groups of classes essentially leads to ambiguous and biased semantic +representations. In this work, we present a geometry-constrained probabilistic +modeling treatment to resolve the identified issues. First, we propose to +parameterize the approximated posterior of instance embedding as a marginal von +MisesFisher distribution to account for the interference of distributional +latent bias. Then, we incorporate a suite of critical geometric properties to +impose proper constraints on the layout of constructed embedding space, which +in turn minimizes the uncontrollable risk for unknown class learning and +structuring. Furthermore, a spectral graph-theoretic method is devised to +estimate the number of potential novel classes. It inherits two intriguing +merits compared to existent approaches, namely high computational efficiency +and flexibility for taxonomy-adaptive estimation. Extensive experiments across +various biomedical scenarios substantiate the effectiveness and general +applicability of our method.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection,Heng Zhang · Qiuyu Zhao · Linyu Zheng · Hao Zeng · Zhiwei Ge · Tianhao Li · Sulong Xu, ,https://arxiv.org/abs/2310.16667,,2310.16667.pdf,CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection,"Deriving reliable region-word alignment from image-text pairs is critical to +learn object-level vision-language representations for open-vocabulary object +detection. Existing methods typically rely on pre-trained or self-trained +vision-language models for alignment, which are prone to limitations in +localization accuracy or generalization capabilities. In this paper, we propose +CoDet, a novel approach that overcomes the reliance on pre-aligned +vision-language space by reformulating region-word alignment as a co-occurring +object discovery problem. Intuitively, by grouping images that mention a shared +concept in their captions, objects corresponding to the shared concept shall +exhibit high co-occurrence among the group. CoDet then leverages visual +similarities to discover the co-occurring objects and align them with the +shared concept. Extensive experiments demonstrate that CoDet has superior +performances and compelling scalability in open-vocabulary detection, e.g., by +scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and +44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 +$\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at +https://github.com/CVMI-Lab/CoDet.",cs.CV,['cs.CV'] +Blind Image Quality Assessment Based on Geometric Order Learning,Nyeong-Ho Shin · Seon-Ho Lee · Chang-Su Kim, ,https://arxiv.org/abs/2404.14949,,2404.14949.pdf,Multi-Modal Prompt Learning on Blind Image Quality Assessment,"Image Quality Assessment (IQA) models benefit significantly from semantic +information, which allows them to treat different types of objects distinctly. +Currently, leveraging semantic information to enhance IQA is a crucial research +direction. Traditional methods, hindered by a lack of sufficiently annotated +data, have employed the CLIP image-text pretraining model as their backbone to +gain semantic awareness. However, the generalist nature of these pre-trained +Vision-Language (VL) models often renders them suboptimal for IQA-specific +tasks. Recent approaches have attempted to address this mismatch using prompt +technology, but these solutions have shortcomings. Existing prompt-based VL +models overly focus on incremental semantic information from text, neglecting +the rich insights available from visual data analysis. This imbalance limits +their performance improvements in IQA tasks. This paper introduces an +innovative multi-modal prompt-based methodology for IQA. Our approach employs +carefully crafted prompts that synergistically mine incremental semantic +information from both visual and linguistic data. Specifically, in the visual +branch, we introduce a multi-layer prompt structure to enhance the VL model's +adaptability. In the text branch, we deploy a dual-prompt scheme that steers +the model to recognize and differentiate between scene category and distortion +type, thereby refining the model's capacity to assess image quality. Our +experimental findings underscore the effectiveness of our method over existing +Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates +competitive performance across various datasets. Our method achieves Spearman +Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ) +and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy +in diverse contexts.",cs.CV,['cs.CV'] +MemoNav: Working Memory Model for Visual Navigation,Hongxin Li · Zeyu Wang · Xu Yang · yuran Yang · Shuqi Mei · Zhaoxiang Zhang,https://github.com/ZJULiHongxin/MemoNav,https://arxiv.org/abs/2402.19161v1,,2402.19161v1.pdf,MemoNav: Working Memory Model for Visual Navigation,"Image-goal navigation is a challenging task that requires an agent to +navigate to a goal indicated by an image in unfamiliar environments. Existing +methods utilizing diverse scene memories suffer from inefficient exploration +since they use all historical observations for decision-making without +considering the goal-relevant fraction. To address this limitation, we present +MemoNav, a novel memory model for image-goal navigation, which utilizes a +working memory-inspired pipeline to improve navigation performance. +Specifically, we employ three types of navigation memory. The node features on +a map are stored in the short-term memory (STM), as these features are +dynamically updated. A forgetting module then retains the informative STM +fraction to increase efficiency. We also introduce long-term memory (LTM) to +learn global scene representations by progressively aggregating STM features. +Subsequently, a graph attention module encodes the retained STM and the LTM to +generate working memory (WM) which contains the scene features essential for +efficient navigation. The synergy among these three memory types boosts +navigation performance by enabling the agent to learn and leverage +goal-relevant scene features within a topological map. Our evaluation on +multi-goal tasks demonstrates that MemoNav significantly outperforms previous +methods across all difficulty levels in both Gibson and Matterport3D scenes. +Qualitative results further illustrate that MemoNav plans more efficient +routes.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts,Fei Ni · Jianye Hao · Shiguang Wu · Longxin Kou · Jiashun Liu · YAN ZHENG · Bin Wang · Yuzheng Zhuang, ,,https://pub.towardsai.net/ai-robotics-breakthroughs-and-trends-at-cvpr-2024-d4a83b5f9564,,,,,nan +RCBEVDet: Radar-camera Fusion in Bird’s Eye View for 3D Object Detection,Zhiwei Lin · Zhe Liu · Zhongyu Xia · Xinhao Wang · Yongtao Wang · Shengxiang Qi · Yang Dong · Nan Dong · Le Zhang · Ce Zhu, ,https://arxiv.org/abs/2403.16440,,2403.16440.pdf,RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection,"Three-dimensional object detection is one of the key tasks in autonomous +driving. To reduce costs in practice, low-cost multi-view cameras for 3D object +detection are proposed to replace the expansive LiDAR sensors. However, relying +solely on cameras is difficult to achieve highly accurate and robust 3D object +detection. An effective solution to this issue is combining multi-view cameras +with the economical millimeter-wave radar sensor to achieve more reliable +multi-modal 3D object detection. In this paper, we introduce RCBEVDet, a +radar-camera fusion 3D object detection method in the bird's eye view (BEV). +Specifically, we first design RadarBEVNet for radar BEV feature extraction. +RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section +(RCS) aware BEV encoder. In the dual-stream radar backbone, a point-based +encoder and a transformer-based encoder are proposed to extract radar features, +with an injection and extraction module to facilitate communication between the +two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to +scattering the point feature in BEV. Besides, we present the Cross-Attention +Multi-layer Fusion module to automatically align the multi-modal BEV feature +from radar and camera with the deformable attention mechanism, and then fuse +the feature with channel and spatial fusion layers. Experimental results show +that RCBEVDet achieves new state-of-the-art radar-camera fusion results on +nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore, +RCBEVDet achieves better 3D detection results than all real-time camera-only +and radar-camera 3D object detectors with a faster inference speed at 21~28 +FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.",cs.CV,['cs.CV'] +Instance-based Max-margin for Practical Few-shot Recognition,Minghao Fu · Ke Zhu,https://github.com/heekhero/IbM2,https://arxiv.org/abs/2312.07856,,2312.07856.pdf,DTL: Disentangled Transfer Learning for Visual Recognition,"When pre-trained models become rapidly larger, the cost of fine-tuning on +downstream tasks steadily increases, too. To economically fine-tune these +models, parameter-efficient transfer learning (PETL) is proposed, which only +tunes a tiny subset of trainable parameters to efficiently learn quality +representations. However, current PETL methods are facing the dilemma that +during training the GPU memory footprint is not effectively reduced as +trainable parameters. PETL will likely fail, too, if the full fine-tuning +encounters the out-of-GPU-memory issue. This phenomenon happens because +trainable parameters from these methods are generally entangled with the +backbone, such that a lot of intermediate states have to be stored in GPU +memory for gradient propagation. To alleviate this problem, we introduce +Disentangled Transfer Learning (DTL), which disentangles the trainable +parameters from the backbone using a lightweight Compact Side Network (CSN). By +progressively extracting task-specific information with a few low-rank linear +mappings and appropriately adding the information back to the backbone, CSN +effectively realizes knowledge transfer in various downstream tasks. We +conducted extensive experiments to validate the effectiveness of our method. +The proposed method not only reduces a large amount of GPU memory usage and +trainable parameters, but also outperforms existing PETL methods by a +significant margin in accuracy, achieving new state-of-the-art on several +standard benchmarks. The code is available at https://github.com/heekhero/DTL.",cs.CV,"['cs.CV', 'cs.AI']" +TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding,Shuhuai Ren · Linli Yao · Shicheng Li · Xu Sun · Lu Hou,https://github.com/RenShuhuai-Andy/TimeChat,https://arxiv.org/abs/2312.02051,,2312.02051.pdf,TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding,"This work proposes TimeChat, a time-sensitive multimodal large language model +specifically designed for long video understanding. Our model incorporates two +key architectural contributions: (1) a timestamp-aware frame encoder that binds +visual content with the timestamp of each frame, and (2) a sliding video +Q-Former that produces a video token sequence of varying lengths to accommodate +videos of various durations. Additionally, we construct an instruction-tuning +dataset, encompassing 6 tasks and a total of 125K instances, to further enhance +TimeChat's instruction-following performance. Experiment results across various +video understanding tasks, such as dense captioning, temporal grounding, and +highlight detection, demonstrate TimeChat's strong zero-shot temporal +localization and reasoning capabilities. For example, it achieves +9.2 F1 score +and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) +on Charades-STA, compared to state-of-the-art video large language models, +holding the potential to serve as a versatile video assistant for long-form +video comprehension tasks and satisfy realistic user requirements.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis,Yufei Ye · Abhinav Gupta · Kris Kitani · Shubham Tulsiani, ,https://arxiv.org/abs/2404.12383,,2404.12383.pdf,G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis,"We propose G-HOP, a denoising diffusion based generative prior for +hand-object interactions that allows modeling both the 3D object and a human +hand, conditioned on the object category. To learn a 3D spatial diffusion model +that can capture this joint distribution, we represent the human hand via a +skeletal distance field to obtain a representation aligned with the (latent) +signed distance field for the object. We show that this hand-object prior can +then serve as generic guidance to facilitate other tasks like reconstruction +from interaction clip and human grasp synthesis. We believe that our model, +trained by aggregating seven diverse real-world interaction datasets spanning +across 155 categories, represents a first approach that allows jointly +generating both hand and object. Our empirical evaluations demonstrate the +benefit of this joint prior in video-based reconstruction and human grasp +synthesis, outperforming current task-specific baselines. + Project website: https://judyye.github.io/ghop-www",cs.CV,['cs.CV'] +VTimeLLM: Empower LLM to Grasp Video Moments,Bin Huang · Xin Wang · Hong Chen · Zihan Song · Wenwu Zhu, ,https://arxiv.org/abs/2311.18445v1,,2311.18445v1.pdf,VTimeLLM: Empower LLM to Grasp Video Moments,"Large language models (LLMs) have shown remarkable text understanding +capabilities, which have been extended as Video LLMs to handle video data for +comprehending visual details. However, existing Video LLMs can only provide a +coarse description of the entire video, failing to capture the precise start +and end time boundary of specific events. In this paper, we solve this issue +via proposing VTimeLLM, a novel Video LLM designed for fine-grained video +moment understanding and reasoning with respect to time boundary. Specifically, +our VTimeLLM adopts a boundary-aware three-stage training strategy, which +respectively utilizes image-text pairs for feature alignment, multiple-event +videos to increase temporal-boundary awareness, and high-quality +video-instruction tuning to further improve temporal understanding ability as +well as align with human intents. Extensive experiments demonstrate that in +fine-grained time-related comprehension tasks for videos such as Temporal Video +Grounding and Dense Video Captioning, VTimeLLM significantly outperforms +existing Video LLMs. Besides, benefits from the fine-grained temporal +understanding of the videos further enable VTimeLLM to beat existing Video LLMs +in video dialogue benchmark, showing its superior cross-modal understanding and +reasoning abilities.",cs.CV,['cs.CV'] +DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction,Jaehyeok Shim · Kyungdon Joo, ,https://arxiv.org/abs/2403.05005,,2403.05005.pdf,DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction,"We propose a novel concept of dual and integrated latent topologies (DITTO in +short) for implicit 3D reconstruction from noisy and sparse point clouds. Most +existing methods predominantly focus on single latent type, such as point or +grid latents. In contrast, the proposed DITTO leverages both point and grid +latents (i.e., dual latent) to enhance their strengths, the stability of grid +latents and the detail-rich capability of point latents. Concretely, DITTO +consists of dual latent encoder and integrated implicit decoder. In the dual +latent encoder, a dual latent layer, which is the key module block composing +the encoder, refines both latents in parallel, maintaining their distinct +shapes and enabling recursive interaction. Notably, a newly proposed dynamic +sparse point transformer within the dual latent layer effectively refines point +latents. Then, the integrated implicit decoder systematically combines these +refined latents, achieving high-fidelity 3D reconstruction and surpassing +previous state-of-the-art methods on object- and scene-level datasets, +especially in thin and detailed structures.",cs.CV,['cs.CV'] +Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation,Lior Talker · Aviad Cohen · Erez Yosef · Alexandra Dana · Michael Dinerstein,https://github.com/liortalker/MindTheEdge,,https://www.youtube.com/watch?v=WPmbAnJk3rE,,,,,nan +StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN,Jongwoo Choi · Kwanggyoon Seo · Amirsaman Ashtari · Junyong Noh,https://jeolpyeoni.github.io/stylecinegan_project/,https://arxiv.org/abs/2403.14186,,2403.14186.pdf,StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN,"We propose a method that can generate cinemagraphs automatically from a still +landscape image using a pre-trained StyleGAN. Inspired by the success of recent +unconditional video generation, we leverage a powerful pre-trained image +generator to synthesize high-quality cinemagraphs. Unlike previous approaches +that mainly utilize the latent space of a pre-trained StyleGAN, our approach +utilizes its deep feature space for both GAN inversion and cinemagraph +generation. Specifically, we propose multi-scale deep feature warping (MSDFW), +which warps the intermediate features of a pre-trained StyleGAN at different +resolutions. By using MSDFW, the generated cinemagraphs are of high resolution +and exhibit plausible looping animation. We demonstrate the superiority of our +method through user studies and quantitative comparisons with state-of-the-art +cinemagraph generation methods and a video generation method that uses a +pre-trained StyleGAN.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']" +Decoupled Pseudo-labeling in Semi-Supervised Monocular 3D Object Detection,Jiacheng Zhang · Jiaming Li · Xiangru Lin · Wei Zhang · Xiao Tan · Junyu Han · Errui Ding · Jingdong Wang · Guanbin Li, ,https://arxiv.org/abs/2403.17387,,2403.17387.pdf,Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection,"We delve into pseudo-labeling for semi-supervised monocular 3D object +detection (SSM3OD) and discover two primary issues: a misalignment between the +prediction quality of 3D and 2D attributes and the tendency of depth +supervision derived from pseudo-labels to be noisy, leading to significant +optimization conflicts with other reliable forms of supervision. We introduce a +novel decoupled pseudo-labeling (DPL) approach for SSM3OD. Our approach +features a Decoupled Pseudo-label Generation (DPG) module, designed to +efficiently generate pseudo-labels by separately processing 2D and 3D +attributes. This module incorporates a unique homography-based method for +identifying dependable pseudo-labels in BEV space, specifically for 3D +attributes. Additionally, we present a DepthGradient Projection (DGP) module to +mitigate optimization conflicts caused by noisy depth supervision of +pseudo-labels, effectively decoupling the depth gradient and removing +conflicting gradients. This dual decoupling strategy-at both the pseudo-label +generation and gradient levels-significantly improves the utilization of +pseudo-labels in SSM3OD. Our comprehensive experiments on the KITTI benchmark +demonstrate the superiority of our method over existing approaches.",cs.CV,['cs.CV'] +T-VSL: Text-Guided Visual Sound Source Localization in Mixtures,Tanvir Mahmud · Yapeng Tian · Diana Marculescu, ,https://arxiv.org/abs/2404.01751v1,,2404.01751v1.pdf,T-VSL: Text-Guided Visual Sound Source Localization in Mixtures,"Visual sound source localization poses a significant challenge in identifying +the semantic region of each sounding source within a video. Existing +self-supervised and weakly supervised source localization methods struggle to +accurately distinguish the semantic regions of each sounding object, +particularly in multi-source mixtures. These methods often rely on audio-visual +correspondence as guidance, which can lead to substantial performance drops in +complex multi-source localization scenarios. The lack of access to individual +source sounds in multi-source mixtures during training exacerbates the +difficulty of learning effective audio-visual correspondence for localization. +To address this limitation, in this paper, we propose incorporating the text +modality as an intermediate feature guide using tri-modal joint embedding +models (e.g., AudioCLIP) to disentangle the semantic audio-visual source +correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by +predicting the class of sounding entities in mixtures. Subsequently, the +textual representation of each sounding source is employed as guidance to +disentangle fine-grained audio-visual source correspondence from multi-source +mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables +our framework to handle a flexible number of sources and exhibits promising +zero-shot transferability to unseen classes during test time. Extensive +experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets +demonstrate significant performance improvements over state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.SD', 'eess.AS']" +USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation,Xiaoqi Wang · Wenbin He · Xiwei Xuan · Clint Sebastian · Jorge Piazentin Ono · Xin Li · Sima Behpour · Thang Doan · Liang Gou · Shen · Liu Ren, ,http://export.arxiv.org/abs/2307.00764,,2307.00764.pdf,Hierarchical Open-vocabulary Universal Image Segmentation,"Open-vocabulary image segmentation aims to partition an image into semantic +regions according to arbitrary text descriptions. However, complex visual +scenes can be naturally decomposed into simpler parts and abstracted at +multiple levels of granularity, introducing inherent segmentation ambiguity. +Unlike existing methods that typically sidestep this ambiguity and treat it as +an external factor, our approach actively incorporates a hierarchical +representation encompassing different semantic-levels into the learning +process. We propose a decoupled text-image fusion mechanism and representation +learning modules for both ""things"" and ""stuff"". Additionally, we systematically +examine the differences that exist in the textual and visual features between +these types of categories. Our resulting model, named HIPIE, tackles +HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a +unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO, +Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the +state-of-the-art results at various levels of image comprehension, including +semantic-level (e.g., semantic segmentation), instance-level (e.g., +panoptic/referring segmentation and object detection), as well as part-level +(e.g., part/subpart segmentation) tasks. Our code is released at +https://github.com/berkeley-hipie/HIPIE.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses,Chen Zhao · Tong Zhang · Zheng Dang · Mathieu Salzmann, ,https://arxiv.org/abs/2403.13683,,2403.13683.pdf,DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses,"Determining the relative pose of an object between two images is pivotal to +the success of generalizable object pose estimation. Existing approaches +typically approximate the continuous pose representation with a large number of +discrete pose hypotheses, which incurs a computationally expensive process of +scoring each hypothesis at test time. By contrast, we present a Deep Voxel +Matching Network (DVMNet) that eliminates the need for pose hypotheses and +computes the relative object pose in a single pass. To this end, we map the two +input RGB images, reference and query, to their respective voxelized 3D +representations. We then pass the resulting voxels through a pose estimation +module, where the voxels are aligned and the pose is computed in an end-to-end +fashion by solving a least-squares problem. To enhance robustness, we introduce +a weighted closest voxel algorithm capable of mitigating the impact of noisy +voxels. We conduct extensive experiments on the CO3D, LINEMOD, and Objaverse +datasets, demonstrating that our method delivers more accurate relative pose +estimates for novel objects at a lower computational cost compared to +state-of-the-art methods. Our code is released at: +https://github.com/sailor-z/DVMNet/.",cs.CV,"['cs.CV', 'cs.RO']" +pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction,David Charatan · Sizhe Lester Li · Andrea Tagliasacchi · Vincent Sitzmann, ,https://arxiv.org/abs/2312.12337,,2312.12337.pdf,pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction,"We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D +radiance fields parameterized by 3D Gaussian primitives from pairs of images. +Our model features real-time and memory-efficient rendering for scalable +training as well as fast 3D reconstruction at inference time. To overcome local +minima inherent to sparse and locally supported representations, we predict a +dense probability distribution over 3D and sample Gaussian means from that +probability distribution. We make this sampling operation differentiable via a +reparameterization trick, allowing us to back-propagate gradients through the +Gaussian splatting representation. We benchmark our method on wide-baseline +novel view synthesis on the real-world RealEstate10k and ACID datasets, where +we outperform state-of-the-art light field transformers and accelerate +rendering by 2.5 orders of magnitude while reconstructing an interpretable and +editable 3D radiance field.",cs.CV,"['cs.CV', 'cs.LG']" +MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation,Yanhui Wang · Jianmin Bao · Wenming Weng · Ruoyu Feng · Dacheng Yin · Tao Yang · Jingxu Zhang · Qi Dai · Zhiyuan Zhao · Chunyu Wang · Kai Qiu · Yuhui Yuan · Xiaoyan Sun · Chong Luo · Baining Guo, ,https://arxiv.org/abs/2311.18829,,2311.18829.pdf,MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation,"We present MicroCinema, a straightforward yet effective framework for +high-quality and coherent text-to-video generation. Unlike existing approaches +that align text prompts with video directly, MicroCinema introduces a +Divide-and-Conquer strategy which divides the text-to-video into a two-stage +process: text-to-image generation and image\&text-to-video generation. This +strategy offers two significant advantages. a) It allows us to take full +advantage of the recent advances in text-to-image models, such as Stable +Diffusion, Midjourney, and DALLE, to generate photorealistic and highly +detailed images. b) Leveraging the generated image, the model can allocate less +focus to fine-grained appearance details, prioritizing the efficient learning +of motion dynamics. To implement this strategy effectively, we introduce two +core designs. First, we propose the Appearance Injection Network, enhancing the +preservation of the appearance of the given image. Second, we introduce the +Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities +of pre-trained 2D diffusion models. These design elements empower MicroCinema +to generate high-quality videos with precise motion, guided by the provided +text prompts. Extensive experiments demonstrate the superiority of the proposed +framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on +UCF-101 and 377.40 on MSR-VTT. See +https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples.",cs.CV,['cs.CV'] +Domain Prompt Learning with Quaternion Networks,Qinglong Cao · Zhengqin Xu · Yuntian Chen · Chao Ma · Xiaokang Yang, ,https://arxiv.org/abs/2312.08878,,2312.08878.pdf,Domain Prompt Learning with Quaternion Networks,"Prompt learning has emerged as an effective and data-efficient technique in +large Vision-Language Models (VLMs). However, when adapting VLMs to specialized +domains such as remote sensing and medical imaging, domain prompt learning +remains underexplored. While large-scale domain-specific foundation models can +help tackle this challenge, their concentration on a single vision level makes +it challenging to prompt both vision and language modalities. To overcome this, +we propose to leverage domain-specific knowledge from domain-specific +foundation models to transfer the robust recognition ability of VLMs from +generalized to specialized domains, using quaternion networks. Specifically, +the proposed method involves using domain-specific vision features from +domain-specific foundation models to guide the transformation of generalized +contextual embeddings from the language branch into a specialized space within +the quaternion networks. Moreover, we present a hierarchical approach that +generates vision prompt features by analyzing intermodal relationships between +hierarchical language prompt features and domain-specific vision features. In +this way, quaternion networks can effectively mine the intermodal relationships +in the specific domain, facilitating domain-specific vision-language +contrastive learning. Extensive experiments on domain-specific datasets show +that our proposed method achieves new state-of-the-art results in prompt +learning.",cs.CV,"['cs.CV', 'cs.LG', 'stat.AP']" +MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints,Pengfei Xie · Wenqiang Xu · Tutian Tang · Zhenjun Yu · Cewu Lu, ,https://arxiv.org/abs/2404.10227,,2404.10227.pdf,MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints,"This work proposes a novel learning framework for visual hand dynamics +analysis that takes into account the physiological aspects of hand motion. The +existing models, which are simplified joint-actuated systems, often produce +unnatural motions. To address this, we integrate a musculoskeletal system with +a learnable parametric hand model, MANO, to create a new model, MS-MANO. This +model emulates the dynamics of muscles and tendons to drive the skeletal +system, imposing physiologically realistic constraints on the resulting torque +trajectories. We further propose a simulation-in-the-loop pose refinement +framework, BioPR, that refines the initial estimated pose through a multi-layer +perceptron (MLP) network. Our evaluation of the accuracy of MS-MANO and the +efficacy of the BioPR is conducted in two separate parts. The accuracy of +MS-MANO is compared with MyoSuite, while the efficacy of BioPR is benchmarked +against two large-scale public datasets and two recent state-of-the-art +methods. The results demonstrate that our approach consistently improves the +baseline methods both quantitatively and qualitatively.",cs.CV,"['cs.CV', 'cs.RO']" +JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients,Woo Kyoung Han · Sunghoon Im · Jaedeok Kim · Kyong Hwan Jin,https://wookyounghan.github.io/JDEC/,https://arxiv.org/abs/2404.05558,,2404.05558.pdf,JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients,"We propose a practical approach to JPEG image decoding, utilizing a local +implicit neural representation with continuous cosine formulation. The JPEG +algorithm significantly quantizes discrete cosine transform (DCT) spectra to +achieve a high compression rate, inevitably resulting in quality degradation +while encoding an image. We have designed a continuous cosine spectrum +estimator to address the quality degradation issue that restores the distorted +spectrum. By leveraging local DCT formulations, our network has the privilege +to exploit dequantization and upsampling simultaneously. Our proposed model +enables decoding compressed images directly across different quality factors +using a single pre-trained model without relying on a conventional JPEG +decoder. As a result, our proposed network achieves state-of-the-art +performance in flexible color image JPEG artifact removal tasks. Our source +code is available at https://github.com/WooKyoungHan/JDEC.",eess.IV,"['eess.IV', 'cs.CV']" +Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation,Sihan liu · Yiwei Ma · Xiaoqing Zhang · Haowei Wang · Jiayi Ji · Xiaoshuai Sun · Rongrong Ji, ,https://arxiv.org/abs/2312.12470,,2312.12470.pdf,Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation,"Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that +combines computer vision and natural language processing, delineating specific +regions in aerial images as described by textual queries. Traditional Referring +Image Segmentation (RIS) approaches have been impeded by the complex spatial +scales and orientations found in aerial imagery, leading to suboptimal +segmentation results. To address these challenges, we introduce the Rotated +Multi-Scale Interaction Network (RMSIN), an innovative approach designed for +the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction +Module (IIM) to effectively address the fine-grained detail required at +multiple scales and a Cross-scale Interaction Module (CIM) for integrating +these details coherently across the network. Furthermore, RMSIN employs an +Adaptive Rotated Convolution (ARC) to account for the diverse orientations of +objects, a novel contribution that significantly enhances segmentation +accuracy. To assess the efficacy of RMSIN, we have curated an expansive dataset +comprising 17,402 image-caption-mask triplets, which is unparalleled in terms +of scale and variety. This dataset not only presents the model with a wide +range of spatial and rotational scenarios but also establishes a stringent +benchmark for the RRSIS task, ensuring a rigorous evaluation of performance. +Our experimental evaluations demonstrate the exceptional performance of RMSIN, +surpassing existing state-of-the-art models by a significant margin. All +datasets and code are made available at https://github.com/Lsan2401/RMSIN.",cs.CV,['cs.CV'] +"IDGuard: Robust, General, Identity-centric POI Proactive Defense Against Face Editing Abuse",Yunshu Dai · Jianwei Fei · Fangjun Huang, ,https://arxiv.org/abs/2311.01357,,2311.01357.pdf,Robust Identity Perceptual Watermark Against Deepfake Face Swapping,"Notwithstanding offering convenience and entertainment to society, Deepfake +face swapping has caused critical privacy issues with the rapid development of +deep generative models. Due to imperceptible artifacts in high-quality +synthetic images, passive detection models against face swapping in recent +years usually suffer performance damping regarding the generalizability issue. +Therefore, several studies have been attempted to proactively protect the +original images against malicious manipulations by inserting invisible signals +in advance. However, the existing proactive defense approaches demonstrate +unsatisfactory results with respect to visual quality, detection accuracy, and +source tracing ability. In this study, to fulfill the research gap, we propose +the first robust identity perceptual watermarking framework that concurrently +performs detection and source tracing against Deepfake face swapping +proactively. We assign identity semantics regarding the image contents to the +watermarks and devise an unpredictable and nonreversible chaotic encryption +system to ensure watermark confidentiality. The watermarks are encoded and +recovered by jointly training an encoder-decoder framework along with +adversarial image manipulations. Falsification and source tracing are +accomplished by justifying the consistency between the content-matched identity +perceptual watermark and the recovered robust watermark from the image. +Extensive experiments demonstrate state-of-the-art detection performance on +Deepfake face swapping under both cross-dataset and cross-manipulation +settings.",cs.CV,['cs.CV'] +Tri-Perspective View Decomposition for Geometry-Aware Depth Completion,Zhiqiang Yan · Yuankai Lin · Kun Wang · Yupeng Zheng · Yufei Wang · Zhenyu Zhang · Jun Li · Jian Yang, ,https://arxiv.org/abs/2403.15008,,2403.15008.pdf,Tri-Perspective View Decomposition for Geometry-Aware Depth Completion,"Depth completion is a vital task for autonomous driving, as it involves +reconstructing the precise 3D geometry of a scene from sparse and noisy depth +measurements. However, most existing methods either rely only on 2D depth +representations or directly incorporate raw 3D point clouds for compensation, +which are still insufficient to capture the fine-grained 3D geometry of the +scene. To address this challenge, we introduce Tri-Perspective view +Decomposition (TPVD), a novel framework that can explicitly model 3D geometry. +In particular, (1) TPVD ingeniously decomposes the original point cloud into +three 2D views, one of which corresponds to the sparse depth input. (2) We +design TPV Fusion to update the 2D TPV features through recurrent 2D-3D-2D +aggregation, where a Distance-Aware Spherical Convolution (DASC) is applied. +(3) By adaptively choosing TPV affinitive neighbors, the newly proposed +Geometric Spatial Propagation Network (GSPN) further improves the geometric +consistency. As a result, our TPVD outperforms existing methods on KITTI, +NYUv2, and SUN RGBD. Furthermore, we build a novel depth completion dataset +named TOFDC, which is acquired by the time-of-flight (TOF) sensor and the color +camera on smartphones. Project page: +https://yanzq95.github.io/projectpage/TOFDC/index.html",cs.CV,['cs.CV'] +Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing,Bingyan Liu · Chengyu Wang · Tingfeng Cao · Kui Jia · Jun Huang, ,https://arxiv.org/abs/2403.03431,,2403.03431.pdf,Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing,"Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have +recently gained significant popularity for creative Text-to-image generation. +Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) +is of greater importance for application developers, which modify objects or +object properties in images by manipulating feature components in attention +layers during the generation process. However, little is known about what +semantic meanings these attention layers have learned and which parts of the +attention maps contribute to the success of image editing. In this paper, we +conduct an in-depth probing analysis and demonstrate that cross-attention maps +in Stable Diffusion often contain object attribution information that can +result in editing failures. In contrast, self-attention maps play a crucial +role in preserving the geometric and shape details of the source image during +the transformation to the target image. Our analysis offers valuable insights +into understanding cross and self-attention maps in diffusion models. Moreover, +based on our findings, we simplify popular image editing methods and propose a +more straightforward yet more stable and efficient tuning-free procedure that +only modifies self-attention maps of the specified attention layers during the +denoising process. Experimental results show that our simplified method +consistently surpasses the performance of popular approaches on multiple +datasets.",cs.CV,['cs.CV'] +RoHM: Robust Human Motion Reconstruction via Diffusion,Siwei Zhang · Bharat Lal Bhatnagar · Yuanlu Xu · Alexander Winkler · Petr Kadlecek · Siyu Tang · Federica Bogo,https://sanweiliti.github.io/ROHM/ROHM.html,https://arxiv.org/abs/2401.08570,,2401.08570.pdf,RoHM: Robust Human Motion Reconstruction via Diffusion,"We propose RoHM, an approach for robust 3D human motion reconstruction from +monocular RGB(-D) videos in the presence of noise and occlusions. Most previous +approaches either train neural networks to directly regress motion in 3D or +learn data-driven motion priors and combine them with optimization at test +time. The former do not recover globally coherent motion and fail under +occlusions; the latter are time-consuming, prone to local minima, and require +manual tuning. To overcome these shortcomings, we exploit the iterative, +denoising nature of diffusion models. RoHM is a novel diffusion-based motion +model that, conditioned on noisy and occluded input data, reconstructs +complete, plausible motions in consistent global coordinates. Given the +complexity of the problem -- requiring one to address different tasks +(denoising and infilling) in different solution spaces (local and global +motion) -- we decompose it into two sub-tasks and learn two models, one for +global trajectory and one for local motion. To capture the correlations between +the two, we then introduce a novel conditioning module, combining it with an +iterative inference scheme. We apply RoHM to a variety of tasks -- from motion +reconstruction and denoising to spatial and temporal infilling. Extensive +experiments on three popular datasets show that our method outperforms +state-of-the-art approaches qualitatively and quantitatively, while being +faster at test time. The code is available at +https://sanweiliti.github.io/ROHM/ROHM.html.",cs.CV,['cs.CV'] +Abductive Ego-View Accident Video Understanding for Safe Driving Perception,Jianwu Fang · Lei-lei Li · Junfei Zhou · Junbin Xiao · Hongkai Yu · Chen Lv · Jianru Xue · Tat-seng Chua,www.lotvsmmau.net,https://arxiv.org/abs/2403.00436,,2403.00436.pdf,Abductive Ego-View Accident Video Understanding for Safe Driving Perception,"We present MM-AU, a novel dataset for Multi-Modal Accident video +Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each +with temporally aligned text descriptions. We annotate over 2.23 million object +boxes and 58,650 pairs of video-based accident reasons, covering 58 accident +categories. MM-AU supports various accident understanding tasks, particularly +multimodal video diffusion to understand accident cause-effect chains for safe +driving. With MM-AU, we present an Abductive accident Video understanding +framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video +diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven +by an abductive CLIP model. This model involves a contrastive interaction loss +to learn the pair co-occurrence of normal, near-accident, accident frames with +the corresponding text descriptions, such as accident reasons, prevention +advice, and accident categories. OAVD enforces the causal region learning while +fixing the content of the original frame background in video generation, to +find the dominant cause-effect chain for certain accidents. Extensive +experiments verify the abductive ability of AdVersa-SD and the superiority of +OAVD against the state-of-the-art diffusion models. Additionally, we provide +careful benchmark evaluations for object detection and accident reason +answering since AdVersa-SD relies on precise object and accident reason +information.",cs.CV,"['cs.CV', 'cs.AI']" +Towards Language-Driven Video Inpainting via Multimodal Large Language Models,Jianzong Wu · Xiangtai Li · Chenyang Si · Shangchen Zhou · Jingkang Yang · Jiangning Zhang · Yining Li · Kai Chen · Yunhai Tong · Ziwei Liu · Chen Change Loy,https://jianzongwu.github.io/projects/rovi/,https://arxiv.org/abs/2401.10226,,2401.10226.pdf,Towards Language-Driven Video Inpainting via Multimodal Large Language Models,"We introduce a new task -- language-driven video inpainting, which uses +natural language instructions to guide the inpainting process. This approach +overcomes the limitations of traditional video inpainting methods that depend +on manually labeled binary masks, a process often tedious and labor-intensive. +We present the Remove Objects from Videos by Instructions (ROVI) dataset, +containing 5,650 videos and 9,091 inpainting results, to support training and +evaluation for this task. We also propose a novel diffusion-based +language-driven video inpainting framework, the first end-to-end baseline for +this task, integrating Multimodal Large Language Models to understand and +execute complex language-based inpainting requests effectively. Our +comprehensive results showcase the dataset's versatility and the model's +effectiveness in various language-instructed inpainting scenarios. We will make +datasets, code, and models publicly available.",cs.CV,['cs.CV'] +Self-Supervised Facial Representation Learning with Facial Region Awareness,Zheng Gao · Ioannis Patras, ,https://arxiv.org/abs/2403.02138,,2403.02138.pdf,Self-Supervised Facial Representation Learning with Facial Region Awareness,"Self-supervised pre-training has been proved to be effective in learning +transferable representations that benefit various visual tasks. This paper asks +this question: can self-supervised pre-training learn general facial +representations for various facial analysis tasks? Recent efforts toward this +goal are limited to treating each face image as a whole, i.e., learning +consistent facial representations at the image-level, which overlooks the +consistency of local facial representations (i.e., facial regions like eyes, +nose, etc). In this work, we make a first attempt to propose a novel +self-supervised facial representation learning framework to learn consistent +global and local facial representations, Facial Region Awareness (FRA). +Specifically, we explicitly enforce the consistency of facial regions by +matching the local facial representations across views, which are extracted +with learned heatmaps highlighting the facial regions. Inspired by the mask +prediction in supervised semantic segmentation, we obtain the heatmaps via +cosine similarity between the per-pixel projection of feature maps and facial +mask embeddings computed from learnable positional embeddings, which leverage +the attention mechanism to globally look up the facial image for facial +regions. To learn such heatmaps, we formulate the learning of facial mask +embeddings as a deep clustering problem by assigning the pixel features from +the feature maps to them. The transfer learning results on facial +classification and regression tasks show that our FRA outperforms previous +pre-trained models and more importantly, using ResNet as the unified backbone +for various tasks, our FRA achieves comparable or even better performance +compared with SOTA methods in facial analysis tasks.",cs.CV,['cs.CV'] +Visual Anagrams: Synthesizing Multi-View Optical Illusions with Diffusion Models,Daniel Geng · Inbum Park · Andrew Owens, ,https://arxiv.org/abs/2311.17919,,2311.17919.pdf,Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models,"We address the problem of synthesizing multi-view optical illusions: images +that change appearance upon a transformation, such as a flip or rotation. We +propose a simple, zero-shot method for obtaining these illusions from +off-the-shelf text-to-image diffusion models. During the reverse diffusion +process, we estimate the noise from different views of a noisy image, and then +combine these noise estimates together and denoise the image. A theoretical +analysis suggests that this method works precisely for views that can be +written as orthogonal transformations, of which permutations are a subset. This +leads to the idea of a visual anagram--an image that changes appearance under +some rearrangement of pixels. This includes rotations and flips, but also more +exotic pixel permutations such as a jigsaw rearrangement. Our approach also +naturally extends to illusions with more than two views. We provide both +qualitative and quantitative results demonstrating the effectiveness and +flexibility of our method. Please see our project webpage for additional +visualizations and results: https://dangeng.github.io/visual_anagrams/",cs.CV,['cs.CV'] +AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving,Mingfu Liang · Jong-Chyi Su · Samuel Schulter · Sparsh Garg · Shiyu Zhao · Ying Wu · Manmohan Chandraker, ,https://arxiv.org/abs/2403.17373,,2403.17373.pdf,AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving,"Autonomous vehicle (AV) systems rely on robust perception models as a +cornerstone of safety assurance. However, objects encountered on the road +exhibit a long-tailed distribution, with rare or unseen categories posing +challenges to a deployed perception model. This necessitates an expensive +process of continuously curating and annotating data with significant human +effort. We propose to leverage recent advances in vision-language and large +language models to design an Automatic Data Engine (AIDE) that automatically +identifies issues, efficiently curates data, improves the model through +auto-labeling, and verifies the model through generation of diverse scenarios. +This process operates iteratively, allowing for continuous self-improvement of +the model. We further establish a benchmark for open-world detection on AV +datasets to comprehensively evaluate various learning paradigms, demonstrating +our method's superior performance at a reduced cost.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Hierarchical Intra-modal Correlation Learning for Label-free 3D Semantic Segmentation,Xin Kang · Lei Chu · Jiahao Li · Xuejin Chen · Yan Lu, ,https://arxiv.org/abs/2309.10649,,2309.10649.pdf,Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D Segmentation,"Current state-of-the-art point cloud-based perception methods usually rely on +large-scale labeled data, which requires expensive manual annotations. A +natural option is to explore the unsupervised methodology for 3D perception +tasks. However, such methods often face substantial performance-drop +difficulties. Fortunately, we found that there exist amounts of image-based +datasets and an alternative can be proposed, i.e., transferring the knowledge +in the 2D images to 3D point clouds. Specifically, we propose a novel approach +for the challenging cross-modal and cross-domain adaptation task by fully +exploring the relationship between images and point clouds and designing +effective feature alignment strategies. Without any 3D labels, our method +achieves state-of-the-art performance for 3D point cloud semantic segmentation +on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to +existing unsupervised and weakly-supervised baselines.",cs.CV,['cs.CV'] +Rethinking Multi-domain Generalization with A General Learning Objective,Zhaorui Tan · Xi Yang · Kaizhu Huang, ,https://arxiv.org/abs/2402.18853,,2402.18853.pdf,Rethinking Multi-domain Generalization with A General Learning Objective,"Multi-domain generalization (mDG) is universally aimed to minimize the +discrepancy between training and testing distributions to enhance +marginal-to-label distribution mapping. However, existing mDG literature lacks +a general learning objective paradigm and often imposes constraints on static +target marginal distributions. In this paper, we propose to leverage a +$Y$-mapping to relax the constraint. We rethink the learning objective for mDG +and design a new \textbf{general learning objective} to interpret and analyze +most existing mDG wisdom. This general objective is bifurcated into two +synergistic amis: learning domain-independent conditional features and +maximizing a posterior. Explorations also extend to two effective +regularization terms that incorporate prior information and suppress invalid +causality, alleviating the issues that come with relaxed constraints. We +theoretically contribute an upper bound for the domain alignment of +domain-independent conditional features, disclosing that many previous mDG +endeavors actually \textbf{optimize partially the objective} and thus lead to +limited performance. As such, our study distills a general learning objective +into four practical components, providing a general, robust, and flexible +mechanism to handle complex domain shifts. Extensive empirical results indicate +that the proposed objective with $Y$-mapping leads to substantially better mDG +performance in various downstream tasks, including regression, segmentation, +and classification.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +VideoMAC: Video Masked Autoencoders Meet ConvNets,Gensheng Pei · Tao Chen · Xiruo Jiang · 刘华峰 Liu · Zeren Sun · Yazhou Yao, ,https://arxiv.org/abs/2402.19082,,2402.19082.pdf,VideoMAC: Video Masked Autoencoders Meet ConvNets,"Recently, the advancement of self-supervised learning techniques, like masked +autoencoders (MAE), has greatly influenced visual representation learning for +images and videos. Nevertheless, it is worth noting that the predominant +approaches in existing masked image / video modeling rely excessively on +resource-intensive vision transformers (ViTs) as the feature encoder. In this +paper, we propose a new approach termed as \textbf{VideoMAC}, which combines +video masked autoencoders with resource-friendly ConvNets. Specifically, +VideoMAC employs symmetric masking on randomly sampled pairs of video frames. +To prevent the issue of mask pattern dissipation, we utilize ConvNets which are +implemented with sparse convolutional operators as encoders. Simultaneously, we +present a simple yet effective masked video modeling (MVM) approach, a dual +encoder architecture comprising an online encoder and an exponential moving +average target encoder, aimed to facilitate inter-frame reconstruction +consistency in videos. Additionally, we demonstrate that VideoMAC, empowering +classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the +benefits of MVM, outperforms ViT-based approaches on downstream tasks, +including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} +$\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / +\textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / +\textbf{11.1\%} PCK@0.1).",cs.CV,['cs.CV'] +EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection,Xuanyu Zhang · Runyi Li · Jiwen Yu · Youmin Xu · Weiqi Li · Jian Zhang,https://xuanyuzhang21.github.io/project/editguard/,https://arxiv.org/abs/2312.08883,,2312.08883.pdf,EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection,"In the era where AI-generated content (AIGC) models can produce stunning and +lifelike images, the lingering shadow of unauthorized reproductions and +malicious tampering poses imminent threats to copyright integrity and +information security. Current image watermarking methods, while widely accepted +for safeguarding visual content, can only protect copyright and ensure +traceability. They fall short in localizing increasingly realistic image +tampering, potentially leading to trust crises, privacy violations, and legal +disputes. To solve this challenge, we propose an innovative proactive forensics +framework EditGuard, to unify copyright protection and tamper-agnostic +localization, especially for AIGC-based editing methods. It can offer a +meticulous embedding of imperceptible watermarks and precise decoding of +tampered areas and copyright information. Leveraging our observed fragility and +locality of image-into-image steganography, the realization of EditGuard can be +converted into a united image-bit steganography issue, thus completely +decoupling the training process from the tampering types. Extensive experiments +demonstrate that our EditGuard balances the tamper localization accuracy, +copyright recovery precision, and generalizability to various AIGC-based +tampering methods, especially for image forgery that is difficult for the naked +eye to detect. The project page is available at +https://xuanyuzhang21.github.io/project/editguard/.",cs.CV,['cs.CV'] +Re-thinking Data Availability Attacks Against Deep Neural Networks,Bin Fang · Bo Li · Shuang Wu · Shouhong Ding · Ran Yi · Lizhuang Ma, ,https://arxiv.org/abs/2401.09740,,2401.09740.pdf,Hijacking Attacks against Neural Networks by Analyzing Training Data,"Backdoors and adversarial examples are the two primary threats currently +faced by deep neural networks (DNNs). Both attacks attempt to hijack the model +behaviors with unintended outputs by introducing (small) perturbations to the +inputs. Backdoor attacks, despite the high success rates, often require a +strong assumption, which is not always easy to achieve in reality. Adversarial +example attacks, which put relatively weaker assumptions on attackers, often +demand high computational resources, yet do not always yield satisfactory +success rates when attacking mainstream black-box models in the real world. +These limitations motivate the following research question: can model hijacking +be achieved more simply, with a higher attack success rate and more reasonable +assumptions? In this paper, we propose CleanSheet, a new model hijacking attack +that obtains the high performance of backdoor attacks without requiring the +adversary to tamper with the model training process. CleanSheet exploits +vulnerabilities in DNNs stemming from the training data. Specifically, our key +idea is to treat part of the clean training data of the target model as +""poisoned data,"" and capture the characteristics of these data that are more +sensitive to the model (typically called robust features) to construct +""triggers."" These triggers can be added to any input example to mislead the +target model, similar to backdoor attacks. We validate the effectiveness of +CleanSheet through extensive experiments on 5 datasets, 79 normally trained +models, 68 pruned models, and 39 defensive models. Results show that CleanSheet +exhibits performance comparable to state-of-the-art backdoor attacks, achieving +an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB, +respectively. Furthermore, CleanSheet consistently maintains a high ASR, when +confronted with various mainstream backdoor defenses.",cs.CR,['cs.CR'] +Multi-View Attentive Contextualization for Multi-View 3D Object Detection,Xianpeng Liu · Ce Zheng · Ming Qian · Nan Xue · Chen Chen · Zhebin Zhang · Chen Li · Tianfu Wu,https://xianpeng919.github.io/mvacon/,https://arxiv.org/abs/2405.12200,,2405.12200.pdf,Multi-View Attentive Contextualization for Multi-View 3D Object Detection,"We present Multi-View Attentive Contextualization (MvACon), a simple yet +effective method for improving 2D-to-3D feature lifting in query-based +multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in +the field of query-based MV3D object detection, prior art often suffers from +either the lack of exploiting high-resolution 2D features in dense +attention-based lifting, due to high computational costs, or from +insufficiently dense grounding of 3D queries to multi-scale 2D features in +sparse attention-based lifting. Our proposed MvACon hits the two birds with one +stone using a representationally dense yet computationally sparse attentive +feature contextualization scheme that is agnostic to specific 2D-to-3D feature +lifting approaches. In experiments, the proposed MvACon is thoroughly tested on +the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable +attention (DFA3D) variant, as well as the PETR, showing consistent detection +performance improvement, especially in enhancing performance in location, +orientation, and velocity prediction. It is also tested on the Waymo-mini +benchmark using BEVFormer with similar improvement. We qualitatively and +quantitatively show that global cluster-based contexts effectively encode dense +scene-level contexts for MV3D object detection. The promising results of our +proposed MvACon reinforces the adage in computer vision -- ``(contextualized) +feature matters"".",cs.CV,['cs.CV'] +RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection,Ximiao Zhang · Min Xu · Xiuzhuang Zhou,https://github.com/cnulab/RealNet,https://arxiv.org/abs/2403.05897,,2403.05897.pdf,RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection,"Self-supervised feature reconstruction methods have shown promising advances +in industrial image anomaly detection and localization. Despite this progress, +these methods still face challenges in synthesizing realistic and diverse +anomaly samples, as well as addressing the feature redundancy and pre-training +bias of pre-trained feature. In this work, we introduce RealNet, a feature +reconstruction network with realistic synthetic anomaly and adaptive feature +selection. It is incorporated with three key innovations: First, we propose +Strength-controllable Diffusion Anomaly Synthesis (SDAS), a diffusion +process-based synthesis strategy capable of generating samples with varying +anomaly strengths that mimic the distribution of real anomalous samples. +Second, we develop Anomaly-aware Features Selection (AFS), a method for +selecting representative and discriminative pre-trained feature subsets to +improve anomaly detection performance while controlling computational costs. +Third, we introduce Reconstruction Residuals Selection (RRS), a strategy that +adaptively selects discriminative residuals for comprehensive identification of +anomalous regions across multiple levels of granularity. We assess RealNet on +four benchmark datasets, and our results demonstrate significant improvements +in both Image AUROC and Pixel AUROC compared to the current state-o-the-art +methods. The code, data, and models are available at +https://github.com/cnulab/RealNet.",cs.CV,['cs.CV'] +CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation,Jun Wang · Yuzhe Qin · Kaiming Kuang · Yigit Korkmaz · Akhilan Gurumoorthy · Hao Su · Xiaolong Wang, ,https://arxiv.org/abs/2402.14795,,2402.14795.pdf,CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation,"We introduce CyberDemo, a novel approach to robotic imitation learning that +leverages simulated human demonstrations for real-world tasks. By incorporating +extensive data augmentation in a simulated environment, CyberDemo outperforms +traditional in-domain real-world demonstrations when transferred to the real +world, handling diverse physical and visual conditions. Regardless of its +affordability and convenience in data collection, CyberDemo outperforms +baseline methods in terms of success rates across various tasks and exhibits +generalizability with previously unseen objects. For example, it can rotate +novel tetra-valve and penta-valve, despite human demonstrations only involving +tri-valves. Our research demonstrates the significant potential of simulated +human demonstrations for real-world dexterous manipulation tasks. More details +can be found at https://cyber-demo.github.io",cs.RO,"['cs.RO', 'cs.CV']" +InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion,Jihyun Lee · Shunsuke Saito · Giljoo Nam · Minhyuk Sung · Tae-Kyun Kim,https://jyunlee.github.io/projects/interhandgen/,https://arxiv.org/abs/2403.17422,,2403.17422.pdf,InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion,"We present InterHandGen, a novel framework that learns the generative prior +of two-hand interaction. Sampling from our model yields plausible and diverse +two-hand shapes in close interaction with or without an object. Our prior can +be incorporated into any optimization or learning methods to reduce ambiguity +in an ill-posed setup. Our key observation is that directly modeling the joint +distribution of multiple instances imposes high learning complexity due to its +combinatorial nature. Thus, we propose to decompose the modeling of joint +distribution into the modeling of factored unconditional and conditional single +instance distribution. In particular, we introduce a diffusion model that +learns the single-hand distribution unconditional and conditional to another +hand via conditioning dropout. For sampling, we combine anti-penetration and +classifier-free guidance to enable plausible generation. Furthermore, we +establish the rigorous evaluation protocol of two-hand synthesis, where our +method significantly outperforms baseline generative models in terms of +plausibility and diversity. We also demonstrate that our diffusion prior can +boost the performance of two-hand reconstruction from monocular in-the-wild +images, achieving new state-of-the-art accuracy.",cs.CV,['cs.CV'] +DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model,Lirui Zhao · Yue Yang · Kaipeng Zhang · Wenqi Shao · Yuxin Zhang · Yu Qiao · Ping Luo · Rongrong Ji, ,https://arxiv.org/abs/2404.01342,,2404.01342.pdf,DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model,"Text-to-image (T2I) generative models have attracted significant attention +and found extensive applications within and beyond academic research. For +example, the Civitai community, a platform for T2I innovation, currently hosts +an impressive array of 74,492 distinct models. However, this diversity presents +a formidable challenge in selecting the most appropriate model and parameters, +a process that typically requires numerous trials. Drawing inspiration from the +tool usage research of large language models (LLMs), we introduce DiffAgent, an +LLM agent designed to screen the accurate selection in seconds via API calls. +DiffAgent leverages a novel two-stage training framework, SFTA, enabling it to +accurately align T2I API responses with user input in accordance with human +preferences. To train and evaluate DiffAgent's capabilities, we present +DABench, a comprehensive dataset encompassing an extensive range of T2I APIs +from the community. Our evaluations reveal that DiffAgent not only excels in +identifying the appropriate T2I API but also underscores the effectiveness of +the SFTA training framework. Codes are available at +https://github.com/OpenGVLab/DiffAgent.",cs.CL,"['cs.CL', 'cs.AI']" +Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation,Mukul Khanna · Yongsen Mao · Hanxiao Jiang · Sanjay Haresh · Brennan Shacklett · Dhruv Batra · Alexander William Clegg · Eric Undersander · Angel Xuan Chang · Manolis Savva, ,https://arxiv.org/abs/2306.11290,,2306.11290.pdf,Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation,"We contribute the Habitat Synthetic Scene Dataset, a dataset of 211 +high-quality 3D scenes, and use it to test navigation agent generalization to +realistic 3D environments. Our dataset represents real interiors and contains a +diverse set of 18,656 models of real-world objects. We investigate the impact +of synthetic 3D scene dataset scale and realism on the task of training +embodied agents to find and navigate to objects (ObjectGoal navigation). By +comparing to synthetic 3D scene datasets from prior work, we find that scale +helps in generalization, but the benefits quickly saturate, making visual +fidelity and correlation to real-world scenes more important. Our experiments +show that agents trained on our smaller-scale dataset can match or outperform +agents trained on much larger datasets. Surprisingly, we observe that agents +trained on just 122 scenes from our dataset outperform agents trained on 10,000 +scenes from the ProcTHOR-10K dataset in terms of zero-shot generalization in +real-world scanned environments.",cs.CV,['cs.CV'] +Objects as volumes: A stochastic geometry view of opaque solids,Bailey Miller · Hanyu Chen · Alice Lai · Ioannis Gkioulekas,https://imaging.cs.cmu.edu/volumetric_opaque_solids/,https://arxiv.org/abs/2312.15406,,2312.15406.pdf,Objects as volumes: A stochastic geometry view of opaque solids,"We develop a theory for the representation of opaque solids as volumes. +Starting from a stochastic representation of opaque solids as random indicator +functions, we prove the conditions under which such solids can be modeled using +exponential volumetric transport. We also derive expressions for the volumetric +attenuation coefficient as a functional of the probability distributions of the +underlying indicator functions. We generalize our theory to account for +isotropic and anisotropic scattering at different parts of the solid, and for +representations of opaque solids as stochastic implicit surfaces. We derive our +volumetric representation from first principles, which ensures that it +satisfies physical constraints such as reciprocity and reversibility. We use +our theory to explain, compare, and correct previous volumetric +representations, as well as propose meaningful extensions that lead to improved +performance in 3D reconstruction tasks.",cs.CV,"['cs.CV', 'cs.GR']" +MeaCap: Memory-Augmented Zero-shot Image Captioning,Zequn Zeng · Yan Xie · Hao Zhang · Chiyu Chen · Zhengjue Wang · Bo Chen,https://github.com/joeyz0z/MeaCap,https://arxiv.org/abs/2403.03715,,2403.03715.pdf,MeaCap: Memory-Augmented Zero-shot Image Captioning,"Zero-shot image captioning (IC) without well-paired image-text data can be +divided into two categories, training-free and text-only-training. Generally, +these two types of methods realize zero-shot IC by integrating pretrained +vision-language models like CLIP for image-text similarity evaluation and a +pre-trained language model (LM) for caption generation. The main difference +between them is whether using a textual corpus to train the LM. Though +achieving attractive performance w.r.t. some metrics, existing methods often +exhibit some common drawbacks. Training-free methods tend to produce +hallucinations, while text-only-training often lose generalization capability. +To move forward, in this paper, we propose a novel Memory-Augmented zero-shot +image Captioning framework (MeaCap). Specifically, equipped with a textual +memory, we introduce a retrieve-then-filter module to get key concepts that are +highly related to the image. By deploying our proposed memory-augmented +visual-related fusion score in a keywords-to-sentence LM, MeaCap can generate +concept-centered captions that keep high consistency with the image with fewer +hallucinations and more world-knowledge. The framework of MeaCap achieves the +state-of-the-art performance on a series of zero-shot IC settings. Our code is +available at https://github.com/joeyz0z/MeaCap.",cs.CV,['cs.CV'] +Weakly Supervised Monocular 3D Detection with a Single-View Image,Xueying Jiang · Sheng Jin · Lewei Lu · Xiaoqin Zhang · Shijian Lu, ,https://arxiv.org/abs/2402.19144,,2402.19144.pdf,Weakly Supervised Monocular 3D Detection with a Single-View Image,"Monocular 3D detection (M3D) aims for precise 3D object localization from a +single-view image which usually involves labor-intensive annotation of 3D +detection boxes. Weakly supervised M3D has recently been studied to obviate the +3D annotation process by leveraging many existing 2D annotations, but it often +requires extra training data such as LiDAR point clouds or multi-view images +which greatly degrades its applicability and usability in various applications. +We propose SKD-WM3D, a weakly supervised monocular 3D detection framework that +exploits depth information to achieve M3D with a single-view image exclusively +without any 3D annotations or other training data. One key design in SKD-WM3D +is a self-knowledge distillation framework, which transforms image features +into 3D-like representations by fusing depth information and effectively +mitigates the inherent depth ambiguity in monocular scenarios with little +computational overhead in inference. In addition, we design an +uncertainty-aware distillation loss and a gradient-targeted transfer modulation +strategy which facilitate knowledge acquisition and knowledge transfer, +respectively. Extensive experiments show that SKD-WM3D surpasses the +state-of-the-art clearly and is even on par with many fully supervised methods.",cs.CV,['cs.CV'] +SemCity: Semantic Scene Generation with Triplane Diffusion,Jumin Lee · Sebin Lee · Changho Jo · Woobin Im · Ju-hyeong Seon · Sung-Eui Yoon, ,https://arxiv.org/abs/2403.07773,,2403.07773.pdf,SemCity: Semantic Scene Generation with Triplane Diffusion,"We present ""SemCity,"" a 3D diffusion model for semantic scene generation in +real-world outdoor environments. Most 3D diffusion models focus on generating a +single object, synthetic indoor scenes, or synthetic outdoor scenes, while the +generation of real-world outdoor scenes is rarely addressed. In this paper, we +concentrate on generating a real-outdoor scene through learning a diffusion +model on a real-world outdoor dataset. In contrast to synthetic data, +real-outdoor datasets often contain more empty spaces due to sensor +limitations, causing challenges in learning real-outdoor distributions. To +address this issue, we exploit a triplane representation as a proxy form of +scene distributions to be learned by our diffusion model. Furthermore, we +propose a triplane manipulation that integrates seamlessly with our triplane +diffusion model. The manipulation improves our diffusion model's applicability +in a variety of downstream tasks related to outdoor scene generation such as +scene inpainting, scene outpainting, and semantic scene completion refinements. +In experimental results, we demonstrate that our triplane diffusion model shows +meaningful generation results compared with existing work in a real-outdoor +dataset, SemanticKITTI. We also show our triplane manipulation facilitates +seamlessly adding, removing, or modifying objects within a scene. Further, it +also enables the expansion of scenes toward a city-level scale. Finally, we +evaluate our method on semantic scene completion refinements where our +diffusion model enhances predictions of semantic scene completion networks by +learning scene distribution. Our code is available at +https://github.com/zoomin-lee/SemCity.",cs.CV,['cs.CV'] +SD2Event: Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras,Yuan Gao · Yuqing Zhu · Xinjun Li · Yimin Du · Tianzhu Zhang, ,https://arxiv.org/abs/2401.01042,,2401.01042.pdf,Relating Events and Frames Based on Self-Supervised Learning and Uncorrelated Conditioning for Unsupervised Domain Adaptation,"Event-based cameras provide accurate and high temporal resolution +measurements for performing computer vision tasks in challenging scenarios, +such as high-dynamic range environments and fast-motion maneuvers. Despite +their advantages, utilizing deep learning for event-based vision encounters a +significant obstacle due to the scarcity of annotated data caused by the +relatively recent emergence of event-based cameras. To overcome this +limitation, leveraging the knowledge available from annotated data obtained +with conventional frame-based cameras presents an effective solution based on +unsupervised domain adaptation. We propose a new algorithm tailored for +adapting a deep neural network trained on annotated frame-based data to +generalize well on event-based unannotated data. Our approach incorporates +uncorrelated conditioning and self-supervised learning in an adversarial +learning scheme to close the gap between the two source and target domains. By +applying self-supervised learning, the algorithm learns to align the +representations of event-based data with those from frame-based camera data, +thereby facilitating knowledge transfer.Furthermore, the inclusion of +uncorrelated conditioning ensures that the adapted model effectively +distinguishes between event-based and conventional data, enhancing its ability +to classify event-based images accurately.Through empirical experimentation and +evaluation, we demonstrate that our algorithm surpasses existing approaches +designed for the same purpose using two benchmarks. The superior performance of +our solution is attributed to its ability to effectively utilize annotated data +from frame-based cameras and transfer the acquired knowledge to the event-based +vision domain.",cs.CV,['cs.CV'] +Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation,Ming Xu · Stephen Gould, ,https://arxiv.org/abs/2404.01518,,2404.01518.pdf,Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation,"We propose a novel approach to the action segmentation task for long, +untrimmed videos, based on solving an optimal transport problem. By encoding a +temporal consistency prior into a Gromov-Wasserstein problem, we are able to +decode a temporally consistent segmentation from a noisy affinity/matching cost +matrix between video frames and action classes. Unlike previous approaches, our +method does not require knowing the action order for a video to attain temporal +consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can +be efficiently solved on GPUs using a few iterations of projected mirror +descent. We demonstrate the effectiveness of our method in an unsupervised +learning setting, where our method is used to generate pseudo-labels for +self-training. We evaluate our segmentation approach and unsupervised learning +pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly +datasets, yielding state-of-the-art results for the unsupervised video action +segmentation task.",cs.CV,"['cs.CV', 'cs.LG', 'eess.IV']" +Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes,YuJie Lu · Long Wan · Nayu Ding · Yulong Wang · Shuhan Shen · Shen Cai · Lin Gao,http://www.cscvlab.com/research/UODFs/index.html,https://arxiv.org/abs/2403.01414,,2403.01414.pdf,Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes,"Neural implicit representation of geometric shapes has witnessed considerable +advancements in recent years. However, common distance field based implicit +representations, specifically signed distance field (SDF) for watertight shapes +or unsigned distance field (UDF) for arbitrary shapes, routinely suffer from +degradation of reconstruction accuracy when converting to explicit surface +points and meshes. In this paper, we introduce a novel neural implicit +representation based on unsigned orthogonal distance fields (UODFs). In UODFs, +the minimal unsigned distance from any spatial point to the shape surface is +defined solely in one orthogonal direction, contrasting with the +multi-directional determination made by SDF and UDF. Consequently, every point +in the 3D UODFs can directly access its closest surface points along three +orthogonal directions. This distinctive feature leverages the accurate +reconstruction of surface points without interpolation errors. We verify the +effectiveness of UODFs through a range of reconstruction examples, extending +from simple watertight or non-watertight shapes to complex shapes that include +hollows, internal or assembling structures.",cs.CV,['cs.CV'] +Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers,Sheng Yang · Jiawang Bai · Kuofeng Gao · Yong Yang · Yiming Li · Shu-Tao Xia,https://github.com/20000yshust/SWARM,https://arxiv.org/abs/2405.10612,,2405.10612.pdf,Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers,"Given the power of vision transformers, a new learning paradigm, pre-training +and then prompting, makes it more efficient and effective to address downstream +visual recognition tasks. In this paper, we identify a novel security threat +towards such a paradigm from the perspective of backdoor attacks. Specifically, +an extra prompt token, called the switch token in this work, can turn the +backdoor mode on, i.e., converting a benign model into a backdoored one. Once +under the backdoor mode, a specific trigger can force the model to predict a +target class. It poses a severe risk to the users of cloud API, since the +malicious behavior can not be activated and detected under the benign mode, +thus making the attack very stealthy. To attack a pre-trained model, our +proposed attack, named SWARM, learns a trigger and prompt tokens including a +switch token. They are optimized with the clean loss which encourages the model +always behaves normally even the trigger presents, and the backdoor loss that +ensures the backdoor can be activated by the trigger when the switch is on. +Besides, we utilize the cross-mode feature distillation to reduce the effect of +the switch token on clean samples. The experiments on diverse visual +recognition tasks confirm the success of our switchable backdoor attack, i.e., +achieving 95%+ attack success rate, and also being hard to be detected and +removed. Our code is available at https://github.com/20000yshust/SWARM.",cs.CV,"['cs.CV', 'cs.CR', 'cs.LG']" +Unmixing before Fusion: A Generalized Paradigm for Multi-Source-based Hyperspectral Image Synthesis,Yang Yu · Erting Pan · Xinya Wang · Yuheng Wu · Xiaoguang Mei · Jiayi Ma,https://hsi-synthesis.github.io/,,https://ieeexplore.ieee.org/document/10414148,,,,,nan +Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling,Liwen Wu · Sai Bi · Zexiang Xu · Fujun Luan · Kai Zhang · Iliyan Georgiev · Kalyan Sunkavalli · Ravi Ramamoorthi,https://lwwu2.github.io/nde/,https://arxiv.org/abs/2405.14847,,2405.14847.pdf,Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling,"Novel-view synthesis of specular objects like shiny metals or glossy paints +remains a significant challenge. Not only the glossy appearance but also global +illumination effects, including reflections of other objects in the +environment, are critical components to faithfully reproduce a scene. In this +paper, we present Neural Directional Encoding (NDE), a view-dependent +appearance encoding of neural radiance fields (NeRF) for rendering specular +objects. NDE transfers the concept of feature-grid-based spatial encoding to +the angular domain, significantly improving the ability to model high-frequency +angular signals. In contrast to previous methods that use encoding functions +with only angular input, we additionally cone-trace spatial features to obtain +a spatially varying directional encoding, which addresses the challenging +interreflection effects. Extensive experiments on both synthetic and real +datasets show that a NeRF model with NDE (1) outperforms the state of the art +on view synthesis of specular objects, and (2) works with small networks to +allow fast (real-time) inference. The project webpage and source code are +available at: \url{https://lwwu2.github.io/nde/}.",cs.CV,['cs.CV'] +ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models,Jeong-gi Kwak · Erqun Dong · Yuhe Jin · Hanseok Ko · Shweta Mahajan · Kwang Moo Yi,https://ubc-vision.github.io/vivid123/,https://arxiv.org/abs/2312.01305,,2312.01305.pdf,ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models,"Generating novel views of an object from a single image is a challenging +task. It requires an understanding of the underlying 3D structure of the object +from an image and rendering high-quality, spatially consistent new views. While +recent methods for view synthesis based on diffusion have shown great progress, +achieving consistency among various view estimates and at the same time abiding +by the desired camera pose remains a critical problem yet to be solved. In this +work, we demonstrate a strikingly simple method, where we utilize a pre-trained +video diffusion model to solve this problem. Our key idea is that synthesizing +a novel view could be reformulated as synthesizing a video of a camera going +around the object of interest -- a scanning video -- which then allows us to +leverage the powerful priors that a video diffusion model would have learned. +Thus, to perform novel-view synthesis, we create a smooth camera trajectory to +the target view that we wish to render, and denoise using both a +view-conditioned diffusion model and a video diffusion model. By doing so, we +obtain a highly consistent novel view synthesis, outperforming the state of the +art.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']" +Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution,Qingping Zheng · Ling Zheng · Yuanfan Guo · Ying Li · Songcen Xu · Jiankang Deng · Hang Xu, ,https://arxiv.org/abs/2403.16643v1,,2403.16643v1.pdf,Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution,"Artifact-free super-resolution (SR) aims to translate low-resolution images +into their high-resolution counterparts with a strict integrity of the original +content, eliminating any distortions or synthetic details. While traditional +diffusion-based SR techniques have demonstrated remarkable abilities to enhance +image detail, they are prone to artifact introduction during iterative +procedures. Such artifacts, ranging from trivial noise to unauthentic textures, +deviate from the true structure of the source image, thus challenging the +integrity of the super-resolution process. In this work, we propose +Self-Adaptive Reality-Guided Diffusion (SARGD), a training-free method that +delves into the latent space to effectively identify and mitigate the +propagation of artifacts. Our SARGD begins by using an artifact detector to +identify implausible pixels, creating a binary mask that highlights artifacts. +Following this, the Reality Guidance Refinement (RGR) process refines artifacts +by integrating this mask with realistic latent representations, improving +alignment with the original image. Nonetheless, initial realistic-latent +representations from lower-quality images result in over-smoothing in the final +output. To address this, we introduce a Self-Adaptive Guidance (SAG) mechanism. +It dynamically computes a reality score, enhancing the sharpness of the +realistic latent. These alternating mechanisms collectively achieve +artifact-free super-resolution. Extensive experiments demonstrate the +superiority of our method, delivering detailed artifact-free high-resolution +images while reducing sampling steps by 2X. We release our code at +https://github.com/ProAirVerse/Self-Adaptive-Guidance-Diffusion.git.",eess.IV,"['eess.IV', 'cs.CV']" +SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes,Yihua Huang · Yangtian Sun · Ziyi Yang · Xiaoyang Lyu · Yan-Pei Cao · Xiaojuan Qi,https://yihua7.github.io/SC-GS-web/,https://arxiv.org/abs/2312.14937,,2312.14937.pdf,SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes,"Novel view synthesis for dynamic scenes is still a challenging problem in +computer vision and graphics. Recently, Gaussian splatting has emerged as a +robust technique to represent static scenes and enable high-quality and +real-time novel view synthesis. Building upon this technique, we propose a new +representation that explicitly decomposes the motion and appearance of dynamic +scenes into sparse control points and dense Gaussians, respectively. Our key +idea is to use sparse control points, significantly fewer in number than the +Gaussians, to learn compact 6 DoF transformation bases, which can be locally +interpolated through learned interpolation weights to yield the motion field of +3D Gaussians. We employ a deformation MLP to predict time-varying 6 DoF +transformations for each control point, which reduces learning complexities, +enhances learning abilities, and facilitates obtaining temporal and spatial +coherent motion patterns. Then, we jointly learn the 3D Gaussians, the +canonical space locations of control points, and the deformation MLP to +reconstruct the appearance, geometry, and dynamics of 3D scenes. During +learning, the location and number of control points are adaptively adjusted to +accommodate varying motion complexities in different regions, and an ARAP loss +following the principle of as rigid as possible is developed to enforce spatial +continuity and local rigidity of learned motions. Finally, thanks to the +explicit sparse motion representation and its decomposition from appearance, +our method can enable user-controlled motion editing while retaining +high-fidelity appearances. Extensive experiments demonstrate that our approach +outperforms existing approaches on novel view synthesis with a high rendering +speed and enables novel appearance-preserved motion editing applications. +Project page: https://yihua7.github.io/SC-GS-web/",cs.CV,"['cs.CV', 'cs.GR']" +PHYSCENE: Physically Interactable 3D Scene Synthesis for Embodied AI,Yandan Yang · Baoxiong Jia · Peiyuan Zhi · Siyuan Huang, ,https://arxiv.org/abs/2404.09465,,2404.09465.pdf,PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI,"With recent developments in Embodied Artificial Intelligence (EAI) research, +there has been a growing demand for high-quality, large-scale interactive scene +generation. While prior methods in scene synthesis have prioritized the +naturalness and realism of the generated scenes, the physical plausibility and +interactivity of scenes have been largely left unexplored. To address this +disparity, we introduce PhyScene, a novel method dedicated to generating +interactive 3D scenes characterized by realistic layouts, articulated objects, +and rich physical interactivity tailored for embodied agents. Based on a +conditional diffusion model for capturing scene layouts, we devise novel +physics- and interactivity-based guidance mechanisms that integrate constraints +from object collision, room layout, and object reachability. Through extensive +experiments, we demonstrate that PhyScene effectively leverages these guidance +functions for physically interactable scene synthesis, outperforming existing +state-of-the-art scene synthesis methods by a large margin. Our findings +suggest that the scenes generated by PhyScene hold considerable potential for +facilitating diverse skill acquisition among agents within interactive +environments, thereby catalyzing further advancements in embodied AI research. +Project website: http://physcene.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']" +PlatoNeRF: 3D Reconstruction in Plato’s Cave via Single-View Two-Bounce Lidar,Tzofi Klinghoffer · Xiaoyu Xiang · Siddharth Somasundaram · Yuchen Fan · Christian Richardt · Ramesh Raskar · Rakesh Ranjan,https://platonerf.github.io/,https://arxiv.org/abs/2312.14239,,2312.14239.pdf,PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar,"3D reconstruction from a single-view is challenging because of the ambiguity +from monocular cues and lack of information about occluded regions. Neural +radiance fields (NeRF), while popular for view synthesis and 3D reconstruction, +are typically reliant on multi-view images. Existing methods for single-view 3D +reconstruction with NeRF rely on either data priors to hallucinate views of +occluded regions, which may not be physically accurate, or shadows observed by +RGB cameras, which are difficult to detect in ambient light and low albedo +backgrounds. We propose using time-of-flight data captured by a single-photon +avalanche diode to overcome these limitations. Our method models two-bounce +optical paths with NeRF, using lidar transient data for supervision. By +leveraging the advantages of both NeRF and two-bounce light measured by lidar, +we demonstrate that we can reconstruct visible and occluded geometry without +data priors or reliance on controlled ambient lighting or scene albedo. In +addition, we demonstrate improved generalization under practical constraints on +sensor spatial- and temporal-resolution. We believe our method is a promising +direction as single-photon lidars become ubiquitous on consumer devices, such +as phones, tablets, and headsets.",cs.CV,"['cs.CV', 'eess.IV']" +Learning to Rank Patches for Unbiased Image Redundancy Reduction,Yang Luo · Zhineng Chen · Peng Zhou · Zuxuan Wu · Xieping Gao · Yu-Gang Jiang, ,https://arxiv.org/abs/2404.00680,,2404.00680.pdf,Learning to Rank Patches for Unbiased Image Redundancy Reduction,"Images suffer from heavy spatial redundancy because pixels in neighboring +regions are spatially correlated. Existing approaches strive to overcome this +limitation by reducing less meaningful image regions. However, current leading +methods rely on supervisory signals. They may compel models to preserve content +that aligns with labeled categories and discard content belonging to unlabeled +categories. This categorical inductive bias makes these methods less effective +in real-world scenarios. To address this issue, we propose a self-supervised +framework for image redundancy reduction called Learning to Rank Patches +(LTRP). We observe that image reconstruction of masked image modeling models is +sensitive to the removal of visible patches when the masking ratio is high +(e.g., 90\%). Building upon it, we implement LTRP via two steps: inferring the +semantic density score of each patch by quantifying variation between +reconstructions with and without this patch, and learning to rank the patches +with the pseudo score. The entire process is self-supervised, thus getting out +of the dilemma of categorical inductive bias. We design extensive experiments +on different datasets and tasks. The results demonstrate that LTRP outperforms +both supervised and other self-supervised methods due to the fair assessment of +image content.",cs.CV,['cs.CV'] +Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation,Jiapeng Su · Qi Fan · Wenjie Pei · Guangming Lu · Fanglin Chen, ,https://arxiv.org/abs/2404.10322v1,,2404.10322v1.pdf,Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation,"Few-shot semantic segmentation (FSS) has achieved great success on segmenting +objects of novel classes, supported by only a few annotated samples. However, +existing FSS methods often underperform in the presence of domain shifts, +especially when encountering new domain styles that are unseen during training. +It is suboptimal to directly adapt or generalize the entire model to new +domains in the few-shot scenario. Instead, our key idea is to adapt a small +adapter for rectifying diverse target domain styles to the source domain. +Consequently, the rectified target domain features can fittingly benefit from +the well-optimized source domain segmentation model, which is intently trained +on sufficient source domain data. Training domain-rectifying adapter requires +sufficiently diverse target domains. We thus propose a novel local-global style +perturbation method to simulate diverse potential target domains by +perturbating the feature channel statistics of the individual images and +collective statistics of the entire source domain, respectively. Additionally, +we propose a cyclic domain alignment module to facilitate the adapter +effectively rectifying domains using a reverse domain rectification +supervision. The adapter is trained to rectify the image features from diverse +synthesized target domains to align with the source domain. During testing on +target domains, we start by rectifying the image features and then conduct +few-shot segmentation on the domain-rectified features. Extensive experiments +demonstrate the effectiveness of our method, achieving promising results on +cross-domain few-shot semantic segmentation tasks. Our code is available at +https://github.com/Matt-Su/DR-Adapter.",cs.CV,['cs.CV'] +GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence,Van Nguyen Nguyen · Thibault Groueix · Mathieu Salzmann · Vincent Lepetit, ,https://arxiv.org/abs/2311.14155,,2311.14155.pdf,GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence,"We present GigaPose, a fast, robust, and accurate method for CAD-based novel +object pose estimation in RGB images. GigaPose first leverages discriminative +""templates"", rendered images of the CAD models, to recover the out-of-plane +rotation and then uses patch correspondences to estimate the four remaining +parameters. Our approach samples templates in only a two-degrees-of-freedom +space instead of the usual three and matches the input image to the templates +using fast nearest-neighbor search in feature space, results in a speedup +factor of 35x compared to the state of the art. Moreover, GigaPose is +significantly more robust to segmentation errors. Our extensive evaluation on +the seven core datasets of the BOP challenge demonstrates that it achieves +state-of-the-art accuracy and can be seamlessly integrated with existing +refinement methods. Additionally, we show the potential of GigaPose with 3D +models predicted by recent work on 3D reconstruction from a single image, +relaxing the need for CAD models and making 6D pose object estimation much more +convenient. Our source code and trained models are publicly available at +https://github.com/nv-nguyen/gigaPose",cs.CV,['cs.CV'] +Fine-grained Prototypical Voting with Heterogeneous Mixup for Semi-supervised 2D-3D Cross-modal Retrieval,Fan Zhang · Xian-Sheng Hua · Chong Chen · Xiao Luo, ,,https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4774118,,,,,nan +SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction,Yang Zhou · Hao Shao · Letian Wang · Steven L. Waslander · Hongsheng Li · Yu Liu, ,https://arxiv.org/abs/2403.11492,,2403.11492.pdf,SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction,"Predicting the future motion of surrounding agents is essential for +autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed +environments. Context information, such as road maps and surrounding agents' +states, provides crucial geometric and semantic information for motion behavior +prediction. To this end, recent works explore two-stage prediction frameworks +where coarse trajectories are first proposed, and then used to select critical +context information for trajectory refinement. However, they either incur a +large amount of computation or bring limited improvement, if not both. In this +paper, we introduce a novel scenario-adaptive refinement strategy, named +SmartRefine, to refine prediction with minimal additional computation. +Specifically, SmartRefine can comprehensively adapt refinement configurations +based on each scenario's properties, and smartly chooses the number of +refinement iterations by introducing a quality score to measure the prediction +quality and remaining refinement potential of each scenario. SmartRefine is +designed as a generic and flexible approach that can be seamlessly integrated +into most state-of-the-art motion prediction models. Experiments on Argoverse +(1 & 2) show that our method consistently improves the prediction accuracy of +multiple state-of-the-art prediction models. Specifically, by adding +SmartRefine to QCNet, we outperform all published ensemble-free works on the +Argoverse 2 leaderboard (single agent track) at submission. Comprehensive +studies are also conducted to ablate design choices and explore the mechanism +behind multi-iteration refinement. Codes are available at +https://github.com/opendilab/SmartRefine/",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +GenTron: Diffusion Transformers for Image and Video Generation,Shoufa Chen · Mengmeng Xu · Jiawei Ren · Yuren Cong · Sen He · Yanping Xie · Animesh Sinha · Ping Luo · Tao Xiang · Juan-Manuel Pérez-Rúa, ,https://arxiv.org/abs/2312.04557,,2312.04557.pdf,GenTron: Diffusion Transformers for Image and Video Generation,"In this study, we explore Transformer-based diffusion models for image and +video generation. Despite the dominance of Transformer architectures in various +fields due to their flexibility and scalability, the visual generative domain +primarily utilizes CNN-based U-Net architectures, particularly in +diffusion-based models. We introduce GenTron, a family of Generative models +employing Transformer-based diffusion, to address this gap. Our initial step +was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a +process involving thorough empirical exploration of the conditioning mechanism. +We then scale GenTron from approximately 900M to over 3B parameters, observing +significant improvements in visual quality. Furthermore, we extend GenTron to +text-to-video generation, incorporating novel motion-free guidance to enhance +video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win +rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text +alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, +underscoring its strengths in compositional generation. We believe this work +will provide meaningful insights and serve as a valuable reference for future +research.",cs.CV,['cs.CV'] +Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch,Xidong Wu · Shangqian Gao · Zeyu Zhang · Zhenzhen Li · Runxue Bao · Yanfu Zhang · Xiaoqian Wang · Heng Huang, ,https://arxiv.org/abs/2403.14729,,2403.14729.pdf,Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch,"Current techniques for deep neural network (DNN) pruning often involve +intricate multi-step processes that require domain-specific expertise, making +their widespread adoption challenging. To address the limitation, the +Only-Train-Once (OTO) and OTOv2 are proposed to eliminate the need for +additional fine-tuning steps by directly training and compressing a general DNN +from scratch. Nevertheless, the static design of optimizers (in OTO) can lead +to convergence issues of local optima. In this paper, we proposed the +Auto-Train-Once (ATO), an innovative network pruning algorithm designed to +automatically reduce the computational and storage costs of DNNs. During the +model training phase, our approach not only trains the target model but also +leverages a controller network as an architecture generator to guide the +learning of target model weights. Furthermore, we developed a novel stochastic +gradient algorithm that enhances the coordination between model training and +controller network training, thereby improving pruning performance. We provide +a comprehensive convergence analysis as well as extensive experiments, and the +results show that our approach achieves state-of-the-art performance across +various model architectures (including ResNet18, ResNet34, ResNet50, ResNet56, +and MobileNetv2) on standard benchmark datasets (CIFAR-10, CIFAR-100, and +ImageNet).",cs.CV,"['cs.CV', 'cs.LG']" +NOPE: Novel Object Pose Estimation from a Single Image,Van Nguyen Nguyen · Thibault Groueix · Georgy Ponimatkin · Yinlin Hu · Renaud Marlet · Mathieu Salzmann · Vincent Lepetit, ,https://arxiv.org/abs/2311.14155,,,GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence,"We present GigaPose, a fast, robust, and accurate method for CAD-based novel +object pose estimation in RGB images. GigaPose first leverages discriminative +""templates"", rendered images of the CAD models, to recover the out-of-plane +rotation and then uses patch correspondences to estimate the four remaining +parameters. Our approach samples templates in only a two-degrees-of-freedom +space instead of the usual three and matches the input image to the templates +using fast nearest-neighbor search in feature space, results in a speedup +factor of 35x compared to the state of the art. Moreover, GigaPose is +significantly more robust to segmentation errors. Our extensive evaluation on +the seven core datasets of the BOP challenge demonstrates that it achieves +state-of-the-art accuracy and can be seamlessly integrated with existing +refinement methods. Additionally, we show the potential of GigaPose with 3D +models predicted by recent work on 3D reconstruction from a single image, +relaxing the need for CAD models and making 6D pose object estimation much more +convenient. Our source code and trained models are publicly available at +https://github.com/nv-nguyen/gigaPose",cs.CV,['cs.CV'] +Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction,Hao Li · Ying Chen · Yifei Chen · Rongshan Yu · Wenxian Yang · Liansheng Wang · Bowen Ding · Yuchen Han, ,https://arxiv.org/abs/2402.19326,,2402.19326.pdf,Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction,"Whole Slide Image (WSI) classification is often formulated as a Multiple +Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have +demonstrated remarkable performance in WSI classification. However, existing +methods leverage coarse-grained pathogenetic descriptions for visual +representation supervision, which are insufficient to capture the complex +visual appearance of pathogenetic images, hindering the generalizability of +models on diverse downstream tasks. Additionally, processing high-resolution +WSIs can be computationally expensive. In this paper, we propose a novel +""Fine-grained Visual-Semantic Interaction"" (FiVE) framework for WSI +classification. It is designed to enhance the model's generalizability by +leveraging the interaction between localized visual patterns and fine-grained +pathological semantics. Specifically, with meticulously designed queries, we +start by utilizing a large language model to extract fine-grained pathological +descriptions from various non-standardized raw reports. The output descriptions +are then reconstructed into fine-grained labels used for training. By +introducing a Task-specific Fine-grained Semantics (TFS) module, we enable +prompts to capture crucial visual information in WSIs, which enhances +representation learning and augments generalization capabilities significantly. +Furthermore, given that pathological visual patterns are redundantly +distributed across tissue slices, we sample a subset of visual instances during +training. Our method demonstrates robust generalizability and strong +transferability, dominantly outperforming the counterparts on the TCGA Lung +Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The +code is available at: https://github.com/ls1rius/WSI_FiVE.",cs.CV,['cs.CV'] +In-distribution Public Data Synthesis with Diffusion Models for Differentially Private Image Classification,Jinseong Park · Yujin Choi · Jaewook Lee,https://jinseongp.github.io/2024/05/28/cvpr2024.html,,https://jinseongp.github.io/2024/05/28/cvpr2024.html,,,,,nan +CapHuman: Capture Your Moments in Parallel Universes,Chao Liang · Fan Ma · Linchao Zhu · Yingying Deng · Yi Yang,https://caphuman.github.io/,https://arxiv.org/abs/2402.00627,,2402.00627.pdf,CapHuman: Capture Your Moments in Parallel Universes,"We concentrate on a novel human-centric image synthesis task, that is, given +only one reference facial photograph, it is expected to generate specific +individual images with diverse head positions, poses, facial expressions, and +illuminations in different contexts. To accomplish this goal, we argue that our +generative model should be capable of the following favorable characteristics: +(1) a strong visual and semantic understanding of our world and human society +for basic object and human image generation. (2) generalizable identity +preservation ability. (3) flexible and fine-grained head control. Recently, +large pre-trained text-to-image diffusion models have shown remarkable results, +serving as a powerful generative foundation. As a basis, we aim to unleash the +above two capabilities of the pre-trained model. In this work, we present a new +framework named CapHuman. We embrace the ""encode then learn to align"" paradigm, +which enables generalizable identity preservation for new individuals without +cumbersome tuning at inference. CapHuman encodes identity features and then +learns to align them into the latent space. Moreover, we introduce the 3D +facial prior to equip our model with control over the human head in a flexible +and 3D-consistent manner. Extensive qualitative and quantitative analyses +demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, +and high-fidelity portraits with content-rich representations and various head +renditions, superior to established baselines. Code and checkpoint will be +released at https://github.com/VamosC/CapHuman.",cs.CV,"['cs.CV', 'cs.AI']" +CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers,Shahaf Arica · Or Rubin · Sapir Gershov · Shlomi Laufer,https://github.com/shahaf-arica/cuvler,https://arxiv.org/abs/2403.07700,,2403.07700.pdf,CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers,"In this paper, we introduce VoteCut, an innovative method for unsupervised +object discovery that leverages feature representations from multiple +self-supervised models. VoteCut employs normalized-cut based graph +partitioning, clustering and a pixel voting approach. Additionally, We present +CuVLER (Cut-Vote-and-LEaRn), a zero-shot model, trained using pseudo-labels, +generated by VoteCut, and a novel soft target loss to refine segmentation +accuracy. Through rigorous evaluations across multiple datasets and several +unsupervised setups, our methods demonstrate significant improvements in +comparison to previous state-of-the-art models. Our ablation studies further +highlight the contributions of each component, revealing the robustness and +efficacy of our approach. Collectively, VoteCut and CuVLER pave the way for +future advancements in image segmentation.",cs.CV,['cs.CV'] +LEDITS++: Limitless Image Editing using Text-to-Image Models,Manuel Brack · Felix Friedrich · Katharina Kornmeier · Linoy Tsaban · Patrick Schramowski · Kristian Kersting · Apolinário Passos, ,https://arxiv.org/abs/2311.16711,,2311.16711.pdf,LEDITS++: Limitless Image Editing using Text-to-Image Models,"Text-to-image diffusion models have recently received increasing interest for +their astonishing ability to produce high-fidelity images from solely text +inputs. Subsequent research efforts aim to exploit and apply their capabilities +to real image editing. However, existing image-to-image methods are often +inefficient, imprecise, and of limited versatility. They either require +time-consuming fine-tuning, deviate unnecessarily strongly from the input +image, and/or lack support for multiple, simultaneous edits. To address these +issues, we introduce LEDITS++, an efficient yet versatile and precise textual +image manipulation technique. LEDITS++'s novel inversion approach requires no +tuning nor optimization and produces high-fidelity results with a few diffusion +steps. Second, our methodology supports multiple simultaneous edits and is +architecture-agnostic. Third, we use a novel implicit masking technique that +limits changes to relevant image regions. We propose the novel TEdBench++ +benchmark as part of our exhaustive evaluation. Our results demonstrate the +capabilities of LEDITS++ and its improvements over previous methods. The +project page is available at https://leditsplusplus-project.static.hf.space .",cs.CV,"['cs.CV', 'cs.AI', 'cs.HC', 'cs.LG']" +Are Conventional SNNs Really Efficient? A Perspective from Network Quantization,Guobin Shen · Dongcheng Zhao · Tenglong Li · Jindong Li · Yi Zeng, ,https://arxiv.org/abs/2311.10802,,2311.10802.pdf,Is Conventional SNN Really Efficient? A Perspective from Network Quantization,"Spiking Neural Networks (SNNs) have been widely praised for their high energy +efficiency and immense potential. However, comprehensive research that +critically contrasts and correlates SNNs with quantized Artificial Neural +Networks (ANNs) remains scant, often leading to skewed comparisons lacking +fairness towards ANNs. This paper introduces a unified perspective, +illustrating that the time steps in SNNs and quantized bit-widths of activation +values present analogous representations. Building on this, we present a more +pragmatic and rational approach to estimating the energy consumption of SNNs. +Diverging from the conventional Synaptic Operations (SynOps), we champion the +""Bit Budget"" concept. This notion permits an intricate discourse on +strategically allocating computational and storage resources between weights, +activation values, and temporal steps under stringent hardware constraints. +Guided by the Bit Budget paradigm, we discern that pivoting efforts towards +spike patterns and weight quantization, rather than temporal attributes, +elicits profound implications for model performance. Utilizing the Bit Budget +for holistic design consideration of SNNs elevates model performance across +diverse data types, encompassing static imagery and neuromorphic datasets. Our +revelations bridge the theoretical chasm between SNNs and quantized ANNs and +illuminate a pragmatic trajectory for future endeavors in energy-efficient +neural computations.",cs.NE,['cs.NE'] +Task-conditioned adaptation of visual features in multi-task policy learning,Pierre Marza · Laetitia Matignon · Olivier Simonin · Christian Wolf,https://pierremarza.github.io/projects/task_conditioned_adaptation/,https://arxiv.org/abs/2402.07739v1,,2402.07739v1.pdf,Task-conditioned adaptation of visual features in multi-task policy learning,"Successfully addressing a wide variety of tasks is a core ability of +autonomous agents, which requires flexibly adapting the underlying +decision-making strategies and, as we argue in this work, also adapting the +underlying perception modules. An analogical argument would be the human visual +system, which uses top-down signals to focus attention determined by the +current task. Similarly, in this work, we adapt pre-trained large vision models +conditioned on specific downstream tasks in the context of multi-task policy +learning. We introduce task-conditioned adapters that do not require finetuning +any pre-trained weights, combined with a single policy trained with behavior +cloning and capable of addressing multiple tasks. We condition the policy and +visual adapters on task embeddings, which can be selected at inference if the +task is known, or alternatively inferred from a set of example demonstrations. +To this end, we propose a new optimization-based estimator. We evaluate the +method on a wide variety of tasks of the CortexBench benchmark and show that, +compared to existing work, it can be addressed with a single policy. In +particular, we demonstrate that adapting visual features is a key design choice +and that the method generalizes to unseen tasks given visual demonstrations.",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']" +Open-Vocabulary Video Anomaly Detection,Peng Wu · Xuerong Zhou · Guansong Pang · Yujia Sun · Jing Liu · Peng Wang · Yanning Zhang, ,https://arxiv.org/abs/2311.07042,,2311.07042.pdf,Open-Vocabulary Video Anomaly Detection,"Video anomaly detection (VAD) with weak supervision has achieved remarkable +performance in utilizing video-level labels to discriminate whether a video +frame is normal or abnormal. However, current approaches are inherently limited +to a closed-set setting and may struggle in open-world applications where there +can be anomaly categories in the test data unseen during training. A few recent +studies attempt to tackle a more realistic setting, open-set VAD, which aims to +detect unseen anomalies given seen anomalies and normal videos. However, such a +setting focuses on predicting frame anomaly scores, having no ability to +recognize the specific categories of anomalies, despite the fact that this +ability is essential for building more informed video surveillance systems. +This paper takes a step further and explores open-vocabulary video anomaly +detection (OVVAD), in which we aim to leverage pre-trained large models to +detect and categorize seen and unseen anomalies. To this end, we propose a +model that decouples OVVAD into two mutually complementary tasks -- +class-agnostic detection and class-specific classification -- and jointly +optimizes both tasks. Particularly, we devise a semantic knowledge injection +module to introduce semantic knowledge from large language models for the +detection task, and design a novel anomaly synthesis module to generate pseudo +unseen anomaly videos with the help of large vision generation models for the +classification task. These semantic knowledge and synthesis anomalies +substantially extend our model's capability in detecting and categorizing a +variety of seen and unseen anomalies. Extensive experiments on three +widely-used benchmarks demonstrate our model achieves state-of-the-art +performance on OVVAD task.",cs.CV,['cs.CV'] +Hierarchical Histogram Threshold Segmentation – Auto-terminating High-detail Oversegmentation,Thomas Chang · Simon Seibt · Bartosz von Rymon Lipinski,https://changtvs.github.io/hierarchical-histogram-threshold-segmentation/,,https://www.nature.com/articles/s41598-023-36066-8,,,,,nan +ManiFPT: Defining and Analyzing Fingerprints of Generative Models,Hae Jin Song · Mahyar Khayatkhoei · Wael AbdAlmageed, ,https://arxiv.org/abs/2402.10401,,2402.10401.pdf,ManiFPT: Defining and Analyzing Fingerprints of Generative Models,"Recent works have shown that generative models leave traces of their +underlying generative process on the generated samples, broadly referred to as +fingerprints of a generative model, and have studied their utility in detecting +synthetic images from real ones. However, the extend to which these +fingerprints can distinguish between various types of synthetic image and help +identify the underlying generative process remain under-explored. In +particular, the very definition of a fingerprint remains unclear, to our +knowledge. To that end, in this work, we formalize the definition of artifact +and fingerprint in generative models, propose an algorithm for computing them +in practice, and finally study its effectiveness in distinguishing a large +array of different generative models. We find that using our proposed +definition can significantly improve the performance on the task of identifying +the underlying generative process from samples (model attribution) compared to +existing methods. Additionally, we study the structure of the fingerprints, and +observe that it is very predictive of the effect of different design choices on +the generative process.",cs.LG,"['cs.LG', 'cs.CV']" +Beyond Text: Frozen Large Language Models in Visual Signal Comprehension,Lei Zhu · Fangyun Wei · Yanye Lu, ,https://arxiv.org/abs/2403.07874,,2403.07874.pdf,Beyond Text: Frozen Large Language Models in Visual Signal Comprehension,"In this work, we investigate the potential of a large language model (LLM) to +directly comprehend visual signals without the necessity of fine-tuning on +multi-modal datasets. The foundational concept of our method views an image as +a linguistic entity, and translates it to a set of discrete words derived from +the LLM's vocabulary. To achieve this, we present the Vision-to-Language +Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a +``foreign language'' with the combined aid of an encoder-decoder, the LLM +vocabulary, and a CLIP model. With this innovative image encoding, the LLM +gains the ability not only for visual comprehension but also for image +denoising and restoration in an auto-regressive fashion-crucially, without any +fine-tuning. We undertake rigorous experiments to validate our method, +encompassing understanding tasks like image recognition, image captioning, and +visual question answering, as well as image denoising tasks like inpainting, +outpainting, deblurring, and shift restoration. Code and models are available +at https://github.com/zh460045050/V2L-Tokenizer.",cs.CV,['cs.CV'] +Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,Philipp Schröppel · Christopher Wewer · Jan Lenssen · Eddy Ilg · Thomas Brox,https://neural-point-cloud-diffusion.github.io/,https://arxiv.org/abs/2312.14124,,2312.14124.pdf,Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,"Controllable generation of 3D assets is important for many practical +applications like content creation in movies, games and engineering, as well as +in AR/VR. Recently, diffusion models have shown remarkable results in +generation quality of 3D objects. However, none of the existing models enable +disentangled generation to control the shape and appearance separately. For the +first time, we present a suitable representation for 3D diffusion models to +enable such disentanglement by introducing a hybrid point cloud and neural +radiance field approach. We model a diffusion process over point positions +jointly with a high-dimensional feature space for a local density and radiance +decoder. While the point positions represent the coarse shape of the object, +the point features allow modeling the geometry and appearance details. This +disentanglement enables us to sample both independently and therefore to +control both separately. Our approach sets a new state of the art in generation +compared to previous disentanglement-capable methods by reduced FID scores of +30-90% and is on-par with other non disentanglement-capable state-of-the art +methods.",cs.CV,['cs.CV'] +SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos,Tao Wu · Runyu He · Gangshan Wu · Limin Wang,https://github.com/MCG-NJU/SportsHHI,https://arxiv.org/abs/2404.04565,,2404.04565.pdf,SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos,"Video-based visual relation detection tasks, such as video scene graph +generation, play important roles in fine-grained video understanding. However, +current video visual relation detection datasets have two main limitations that +hinder the progress of research in this area. First, they do not explore +complex human-human interactions in multi-person scenarios. Second, the +relation types of existing datasets have relatively low-level semantics and can +be often recognized by appearance or simple prior information, without the need +for detailed spatio-temporal context reasoning. Nevertheless, comprehending +high-level interactions between humans is crucial for understanding complex +multi-person videos, such as sports and surveillance videos. To address this +issue, we propose a new video visual relation detection task: video human-human +interaction detection, and build a dataset named SportsHHI for it. SportsHHI +contains 34 high-level interaction classes from basketball and volleyball +sports. 118,075 human bounding boxes and 50,649 interaction instances are +annotated on 11,398 keyframes. To benchmark this, we propose a two-stage +baseline method and conduct extensive experiments to reveal the key factors for +a successful human-human interaction detector. We hope that SportsHHI can +stimulate research on human interaction understanding in videos and promote the +development of spatio-temporal context modeling techniques in video visual +relation detection.",cs.CV,['cs.CV'] +Time-Efficient Light-Field Acquisition Using Coded Aperture and Events,Shuji Habuchi · Keita Takahashi · Chihiro Tsutake · Toshiaki Fujii · Hajime Nagahara,https://www.fujii.nuee.nagoya-u.ac.jp/Research/EventLF/,https://arxiv.org/abs/2403.07244,,2403.07244.pdf,Time-Efficient Light-Field Acquisition Using Coded Aperture and Events,"We propose a computational imaging method for time-efficient light-field +acquisition that combines a coded aperture with an event-based camera. +Different from the conventional coded-aperture imaging method, our method +applies a sequence of coding patterns during a single exposure for an image +frame. The parallax information, which is related to the differences in coding +patterns, is recorded as events. The image frame and events, all of which are +measured in a single exposure, are jointly used to computationally reconstruct +a light field. We also designed an algorithm pipeline for our method that is +end-to-end trainable on the basis of deep optics and compatible with real +camera hardware. We experimentally showed that our method can achieve more +accurate reconstruction than several other imaging methods with a single +exposure. We also developed a hardware prototype with the potential to complete +the measurement on the camera within 22 msec and demonstrated that light fields +from real 3-D scenes can be obtained with convincing visual quality. Our +software and supplementary video are available from our project website.",cs.CV,"['cs.CV', 'eess.IV']" +Rapid Motor Adaptation for Robotic Manipulator Arms,Yichao Liang · Kevin Ellis · João F. Henriques, ,https://arxiv.org/abs/2312.04670v1,,2312.04670v1.pdf,Rapid Motor Adaptation for Robotic Manipulator Arms,"Developing generalizable manipulation skills is a core challenge in embodied +AI. This includes generalization across diverse task configurations, +encompassing variations in object shape, density, friction coefficient, and +external disturbances such as forces applied to the robot. Rapid Motor +Adaptation (RMA) offers a promising solution to this challenge. It posits that +essential hidden variables influencing an agent's task performance, such as +object mass and shape, can be effectively inferred from the agent's action and +proprioceptive history. Drawing inspiration from RMA in locomotion and in-hand +rotation, we use depth perception to develop agents tailored for rapid motor +adaptation in a variety of manipulation tasks. We evaluated our agents on four +challenging tasks from the Maniskill2 benchmark, namely pick-and-place +operations with hundreds of objects from the YCB and EGAD datasets, peg +insertion with precise position and orientation, and operating a variety of +faucets and handles, with customized environment variations. Empirical results +demonstrate that our agents surpass state-of-the-art methods like automatic +domain randomization and vision-based policies, obtaining better generalization +performance and sample efficiency.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG']" +Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation,Jin Wang · Bingfeng Zhang · Jian Pang · Honglong Chen · Weifeng Liu, ,https://arxiv.org/abs/2405.08458,,2405.08458.pdf,Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation,"Few-shot segmentation remains challenging due to the limitations of its +labeling information for unseen classes. Most previous approaches rely on +extracting high-level feature maps from the frozen visual encoder to compute +the pixel-wise similarity as a key prior guidance for the decoder. However, +such a prior representation suffers from coarse granularity and poor +generalization to new classes since these high-level feature maps have obvious +category bias. In this work, we propose to replace the visual prior +representation with the visual-text alignment capacity to capture more reliable +guidance and enhance the model generalization. Specifically, we design two +kinds of training-free prior information generation strategy that attempts to +utilize the semantic alignment capability of the Contrastive Language-Image +Pre-training model (CLIP) to locate the target class. Besides, to acquire more +accurate prior guidance, we build a high-order relationship of attention maps +and utilize it to refine the initial prior information. Experiments on both the +PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly +substantial improvement and reaches the new state-of-the-art performance.",cs.CV,['cs.CV'] +A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning,Yuelin Zhang · Pengyu Zheng · Wanquan Yan · Chengyu Fang · Shing Shin Cheng, ,https://arxiv.org/abs/2403.02611,,2403.02611.pdf,A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning,"Defocus blur is a persistent problem in microscope imaging that poses harm to +pathology interpretation and medical intervention in cell microscopy and +microscope surgery. To address this problem, a unified framework including the +multi-pyramid transformer (MPT) and extended frequency contrastive +regularization (EFCR) is proposed to tackle two outstanding challenges in +microscopy deblur: longer attention span and data deficiency. The MPT employs +an explicit pyramid structure at each network stage that integrates the +cross-scale window attention (CSWA), the intra-scale channel attention (ISCA), +and the feature-enhancing feed-forward network (FEFN) to capture long-range +cross-scale spatial interaction and global channel context. The EFCR addresses +the data deficiency problem by exploring latent deblur signals from different +frequency bands. It also enables deblur knowledge transfer to learn +cross-domain information from extra data, improving deblur performance for +labeled and unlabeled data. Extensive experiments and downstream task +validation show the framework achieves state-of-the-art performance across +multiple datasets. Project page: https://github.com/PieceZhang/MPT-CataBlur.",cs.CV,"['cs.CV', 'cs.AI']" +Rotation-Agnostic Image Representation Learning for Digital Pathology,Saghir Alfasly · Abubakr Shafique · Peyman Nejat · Jibran Khan · Areej Alsaafin · Ghazal Alabtah · Hamid Tizhoosh,https://kimialabmayo.github.io/PathDino-Page/,https://arxiv.org/abs/2311.08359,,2311.08359.pdf,Rotation-Agnostic Image Representation Learning for Digital Pathology,"This paper addresses complex challenges in histopathological image analysis +through three key contributions. Firstly, it introduces a fast patch selection +method, FPS, for whole-slide image (WSI) analysis, significantly reducing +computational cost while maintaining accuracy. Secondly, it presents PathDino, +a lightweight histopathology feature extractor with a minimal configuration of +five Transformer blocks and only 9 million parameters, markedly fewer than +alternatives. Thirdly, it introduces a rotation-agnostic representation +learning paradigm using self-supervised learning, effectively mitigating +overfitting. We also show that our compact model outperforms existing +state-of-the-art histopathology-specific vision transformers on 12 diverse +datasets, including both internal datasets spanning four sites (breast, liver, +skin, and colorectal) and seven public datasets (PANDA, CAMELYON16, BRACS, +DigestPath, Kather, PanNuke, and WSSS4LUAD). Notably, even with a training +dataset of 6 million histopathology patches from The Cancer Genome Atlas +(TCGA), our approach demonstrates an average 8.5% improvement in patch-level +majority vote performance. These contributions provide a robust framework for +enhancing image analysis in digital pathology, rigorously validated through +extensive evaluation. Project Page: +https://kimialabmayo.github.io/PathDino-Page/",cs.CV,['cs.CV'] +Weakly Misalignment-free Adaptive Feature Alignment for UAVs-based Multimodal Object Detection,Chen Chen · Jiahao Qi · Xingyue Liu · Kangcheng Bin · Ruigang Fu · Xikun Hu · Ping Zhong, ,https://arxiv.org/abs/2405.16873,,2405.16873.pdf,ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection,"In the field of 3D object detection tasks, fusing heterogeneous features from +LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is +a widely adopted paradigm. However, existing methods are often compromised by +imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera +BEV fusion. Moreover, such inaccuracies result in errors in depth estimation +for the camera branch, ultimately causing misalignment between LiDAR and camera +BEV features. In this work, we propose a novel ContrastAlign approach that +utilizes contrastive learning to enhance the alignment of heterogeneous +modalities, thereby improving the robustness of the fusion process. +Specifically, our approach includes the L-Instance module, which directly +outputs LiDAR instance features within LiDAR BEV features. Then, we introduce +the C-Instance module, which predicts camera instance features through RoI +(Region of Interest) pooling on the camera BEV features. We propose the +InstanceFusion module, which utilizes contrastive learning to generate similar +instance features across heterogeneous modalities. We then use graph matching +to calculate the similarity between the neighboring camera instance features +and the similarity instance features to complete the alignment of instance +features. Our method achieves state-of-the-art performance, with an mAP of +70.3%, surpassing BEVFusion by 1.8% on the nuScenes validation set. +Importantly, our method outperforms BEVFusion by 7.3% under conditions with +misalignment noise.",cs.CV,['cs.CV'] +Learning with Structural Labels for Learning with Noisy Labels,Noo-ri Kim · Jin-Seop Lee · Jee-Hyong Lee, ,https://arxiv.org/abs/2401.04390,,2401.04390.pdf,Learning with Noisy Labels: Interconnection of Two Expectation-Maximizations,"Labor-intensive labeling becomes a bottleneck in developing computer vision +algorithms based on deep learning. For this reason, dealing with imperfect +labels has increasingly gained attention and has become an active field of +study. We address learning with noisy labels (LNL) problem, which is formalized +as a task of finding a structured manifold in the midst of noisy data. In this +framework, we provide a proper objective function and an optimization algorithm +based on two expectation-maximization (EM) cycles. The separate networks +associated with the two EM cycles collaborate to optimize the objective +function, where one model is for distinguishing clean labels from corrupted +ones while the other is for refurbishing the corrupted labels. This approach +results in a non-collapsing LNL-flywheel model in the end. Experiments show +that our algorithm achieves state-of-the-art performance in multiple standard +benchmarks with substantial margins under various types of label noise.",cs.CV,['cs.CV'] +Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training,Runze He · Shaofei Huang · Xuecheng Nie · Tianrui Hui · Luoqi Liu · Jiao Dai · Jizhong Han · Guanbin Li · Si Liu,https://customnerf.github.io/,https://arxiv.org/abs/2312.01663,,2312.01663.pdf,Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training,"In this paper, we target the adaptive source driven 3D scene editing task by +proposing a CustomNeRF model that unifies a text description or a reference +image as the editing prompt. However, obtaining desired editing results +conformed with the editing prompt is nontrivial since there exist two +significant challenges, including accurate editing of only foreground regions +and multi-view consistency given a single-view reference image. To tackle the +first challenge, we propose a Local-Global Iterative Editing (LGIE) training +scheme that alternates between foreground region editing and full-image +editing, aimed at foreground-only manipulation while preserving the background. +For the second challenge, we also design a class-guided regularization that +exploits class priors within the generation model to alleviate the +inconsistency problem among different views in image-driven editing. Extensive +experiments show that our CustomNeRF produces precise editing results under +various real scenes for both text- and image-driven settings.",cs.CV,"['cs.CV', 'cs.AI']" +Training-free Pretrained Model Merging,Zhengqi Xu · Ke Yuan · Huiqiong Wang · Yong Wang · Mingli Song · Jie Song,https://github.com/zju-vipa/training_free_model_merging,https://arxiv.org/abs/2403.01753,,2403.01753.pdf,Training-Free Pretrained Model Merging,"Recently, model merging techniques have surfaced as a solution to combine +multiple single-talent models into a single multi-talent model. However, +previous endeavors in this field have either necessitated additional training +or fine-tuning processes, or require that the models possess the same +pre-trained initialization. In this work, we identify a common drawback in +prior works w.r.t. the inconsistency of unit similarity in the weight space and +the activation space. To address this inconsistency, we propose an innovative +model merging framework, coined as merging under dual-space constraints +(MuDSC). Specifically, instead of solely maximizing the objective of a single +space, we advocate for the exploration of permutation matrices situated in a +region with a unified high similarity in the dual space, achieved through the +linear combination of activation and weight similarity matrices. In order to +enhance usability, we have also incorporated adaptations for group structure, +including Multi-Head Attention and Group Normalization. Comprehensive +experimental comparisons demonstrate that MuDSC can significantly boost the +performance of merged models with various task combinations and architectures. +Furthermore, the visualization of the merged model within the multi-task loss +landscape reveals that MuDSC enables the merged model to reside in the +overlapping segment, featuring a unified lower loss for each task. Our code is +publicly available at https://github.com/zju-vipa/training_free_model_merging.",cs.CV,['cs.CV'] +SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model,Zhengang Li · Yan Kang · Yuchen Liu · Difan Liu · Tobias Hinz · Feng Liu · Yanzhi Wang, ,https://ar5iv.labs.arxiv.org/html/2211.11018,,2211.11018.pdf,MagicVideo: Efficient Video Generation With Latent Diffusion Models,"We present an efficient text-to-video generation framework based on latent +diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips +that are concordant with the given text descriptions. Due to a novel and +efficient 3D U-Net design and modeling video distributions in a low-dimensional +space, MagicVideo can synthesize video clips with 256x256 spatial resolution on +a single GPU card, which takes around 64x fewer computations than the Video +Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works +that directly train video models in the RGB space, we use a pre-trained VAE to +map video clips into a low-dimensional latent space and learn the distribution +of videos' latent codes via a diffusion model. Besides, we introduce two new +designs to adapt the U-Net denoiser trained on image tasks to video data: a +frame-wise lightweight adaptor for the image-to-video distribution adjustment +and a directed temporal attention module to capture temporal dependencies +across frames. Thus, we can exploit the informative weights of convolution +operators from a text-to-image model for accelerating video training. To +ameliorate the pixel dithering in the generated videos, we also propose a novel +VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive +experiments and demonstrate that MagicVideo can generate high-quality video +clips with either realistic or imaginary content. Refer to +\url{https://magicvideo.github.io/#} for more examples.",cs.CV,['cs.CV'] +GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects,Sungphill Moon · Hyeontae Son · Dongcheol Hur · Sangwook Kim, ,https://arxiv.org/abs/2403.11510,,2403.11510.pdf,GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects,"Despite the progress of learning-based methods for 6D object pose estimation, +the trade-off between accuracy and scalability for novel objects still exists. +Specifically, previous methods for novel objects do not make good use of the +target object's 3D shape information since they focus on generalization by +processing the shape indirectly, making them less effective. We present +GenFlow, an approach that enables both accuracy and generalization to novel +objects with the guidance of the target object's shape. Our method predicts +optical flow between the rendered image and the observed image and refines the +6D pose iteratively. It boosts the performance by a constraint of the 3D shape +and the generalizable geometric knowledge learned from an end-to-end +differentiable system. We further improve our model by designing a cascade +network architecture to exploit the multi-scale correlations and coarse-to-fine +refinement. GenFlow ranked first on the unseen object pose estimation +benchmarks in both the RGB and RGB-D cases. It also achieves performance +competitive with existing state-of-the-art methods for the seen object pose +estimation without any fine-tuning.",cs.CV,['cs.CV'] +Day-Night Cross-domain Vehicle Re-identification,Hongchao Li · Jingong Chen · AIHUA ZHENG · Yong Wu · YongLong Luo, ,,https://www.mdpi.com/2079-9292/13/10/1823,,,,,nan +Making Visual Sense of Oracle Bones for You and Me,Runqi Qiao · LAN YANG · Kaiyue Pang · Honggang Zhang, ,https://arxiv.org/abs/2311.15421,,2311.15421.pdf,Wired Perspectives: Multi-View Wire Art Embraces Generative AI,"Creating multi-view wire art (MVWA), a static 3D sculpture with diverse +interpretations from different viewpoints, is a complex task even for skilled +artists. In response, we present DreamWire, an AI system enabling everyone to +craft MVWA easily. Users express their vision through text prompts or +scribbles, freeing them from intricate 3D wire organisation. Our approach +synergises 3D B\'ezier curves, Prim's algorithm, and knowledge distillation +from diffusion models or their variants (e.g., ControlNet). This blend enables +the system to represent 3D wire art, ensuring spatial continuity and overcoming +data scarcity. Extensive evaluation and analysis are conducted to shed insight +on the inner workings of the proposed system, including the trade-off between +connectivity and visual aesthetics.",cs.CV,"['cs.CV', 'cs.AI']" +EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI,Tai Wang · Xiaohan Mao · Chenming Zhu · Runsen Xu · Ruiyuan Lyu · Peisen Li · Xiao Chen · Wenwei Zhang · Kai Chen · Tianfan Xue · Xihui Liu · Cewu Lu · Dahua Lin · Jiangmiao Pang, ,https://arxiv.org/abs/2312.16170v1,,2312.16170v1.pdf,EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI,"In the realm of computer vision and robotics, embodied agents are expected to +explore their environment and carry out human instructions. This necessitates +the ability to fully understand 3D scenes given their first-person observations +and contextualize them into language for interaction. However, traditional +research focuses more on scene-level input and output setups from a global +view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric +3D perception dataset and benchmark for holistic 3D scene understanding. It +encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language +prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which +partially align with LVIS, and dense semantic occupancy with 80 common +categories. Building upon this database, we introduce a baseline framework +named Embodied Perceptron. It is capable of processing an arbitrary number of +multi-modal inputs and demonstrates remarkable 3D perception capabilities, both +within the two series of benchmarks we set up, i.e., fundamental 3D perception +tasks and language-grounded tasks, and in the wild. Codes, datasets, and +benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios,HyunJun Jung · Shun-Cheng Wu · Patrick Ruhkamp · Guangyao Zhai · Hannah Schieber · Giulia Rizzoli · Pengyuan Wang · Hongcheng Zhao · Lorenzo Garattoni · Sven Meier · Daniel Roth · Nassir Navab · Benjamin Busam,https://sites.google.com/view/housecat6d,https://ar5iv.labs.arxiv.org/html/2308.10627,,2308.10627.pdf,Polarimetric Information for Multi-Modal 6D Pose Estimation of Photometrically Challenging Objects with Limited Data,"6D pose estimation pipelines that rely on RGB-only or RGB-D data show +limitations for photometrically challenging objects with e.g. textureless +surfaces, reflections or transparency. A supervised learning-based method +utilising complementary polarisation information as input modality is proposed +to overcome such limitations. This supervised approach is then extended to a +self-supervised paradigm by leveraging physical characteristics of polarised +light, thus eliminating the need for annotated real data. The methods achieve +significant advancements in pose estimation by leveraging geometric information +from polarised light and incorporating shape priors and invertible physical +constraints.",cs.CV,['cs.CV'] +SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image,Yunhao Li · Xiaodong Wang · Ping Wang · Xin Yuan · Peidong Liu, ,https://arxiv.org/abs/2403.20018,,2403.20018.pdf,SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image,"In this paper, we explore the potential of Snapshot Compressive Imaging (SCI) +technique for recovering the underlying 3D scene representation from a single +temporal compressed image. SCI is a cost-effective method that enables the +recording of high-dimensional data, such as hyperspectral or temporal +information, into a single image using low-cost 2D imaging sensors. To achieve +this, a series of specially designed 2D masks are usually employed, which not +only reduces storage requirements but also offers potential privacy protection. +Inspired by this, to take one step further, our approach builds upon the +powerful 3D scene representation capabilities of neural radiance fields (NeRF). +Specifically, we formulate the physical imaging process of SCI as part of the +training of NeRF, allowing us to exploit its impressive performance in +capturing complex scene structures. To assess the effectiveness of our method, +we conduct extensive evaluations using both synthetic data and real data +captured by our SCI system. Extensive experimental results demonstrate that our +proposed approach surpasses the state-of-the-art methods in terms of image +reconstruction and novel view image synthesis. Moreover, our method also +exhibits the ability to restore high frame-rate multi-view consistent images by +leveraging SCI and the rendering capabilities of NeRF. The code is available at +https://github.com/WU-CVGL/SCINeRF.",eess.IV,"['eess.IV', 'cs.CV']" +Source-Free Domain Adaptation with Frozen Multimodal Foundation Model,Song Tang · Wenxin Su · Mao Ye · Xiatian Zhu,https://www.taulab.cc/proj/sfda/cvpr24/difo/index.html,https://arxiv.org/abs/2311.16510,,2311.16510.pdf,Source-Free Domain Adaptation with Frozen Multimodal Foundation Model,"Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a +target domain, with only access to unlabeled target training data and the +source model pre-trained on a supervised source domain. Relying on pseudo +labeling and/or auxiliary supervision, conventional methods are inevitably +error-prone. To mitigate this limitation, in this work we for the first time +explore the potentials of off-the-shelf vision-language (ViL) multimodal models +(e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly +applying the ViL model to the target domain in a zero-shot fashion is +unsatisfactory, as it is not specialized for this particular task but largely +generic. To make it task specific, we propose a novel Distilling multimodal +Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps +during adaptation: (i) Customizing the ViL model by maximizing the mutual +information with the target model in a prompt learning manner, (ii) Distilling +the knowledge of this customized ViL model to the target model. For more +fine-grained and reliable distillation, we further introduce two effective +regularization terms, namely most-likely category encouragement and predictive +consistency. Extensive experiments show that DIFO significantly outperforms the +state-of-the-art alternatives. Code is here",cs.CV,['cs.CV'] +InstructVideo: Instructing Video Diffusion Models with Human Feedback,Hangjie Yuan · Shiwei Zhang · Xiang Wang · Yujie Wei · Tao Feng · Yining Pan · Yingya Zhang · Ziwei Liu · Samuel Albanie · Dong Ni, ,https://arxiv.org/abs/2312.12490,,2312.12490.pdf,InstructVideo: Instructing Video Diffusion Models with Human Feedback,"Diffusion models have emerged as the de facto paradigm for video generation. +However, their reliance on web-scale data of varied quality often yields +results that are visually unappealing and misaligned with the textual prompts. +To tackle this problem, we propose InstructVideo to instruct text-to-video +diffusion models with human feedback by reward fine-tuning. InstructVideo has +two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by +generating through the full DDIM sampling chain, we recast reward fine-tuning +as editing. By leveraging the diffusion process to corrupt a sampled video, +InstructVideo requires only partial inference of the DDIM sampling chain, +reducing fine-tuning cost while improving fine-tuning efficiency. 2) To +mitigate the absence of a dedicated video reward model for human preferences, +we repurpose established image reward models, e.g., HPSv2. To this end, we +propose Segmental Video Reward, a mechanism to provide reward signals based on +segmental sparse sampling, and Temporally Attenuated Reward, a method that +mitigates temporal modeling degradation during fine-tuning. Extensive +experiments, both qualitative and quantitative, validate the practicality and +efficacy of using image reward models in InstructVideo, significantly enhancing +the visual quality of generated videos without compromising generalization +capabilities. Code and models will be made publicly available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']" +Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation.,Dong Zhao · Shuang Wang · Qi Zang · Licheng Jiao · Nicu Sebe · Zhun Zhong, ,,,,,,,nan +FREE: Faster and Better Data-Free Meta-Learning,Yongxian Wei · Zixuan Hu · Zhenyi Wang · Li Shen · Chun Yuan · Dacheng Tao, ,https://arxiv.org/abs/2405.00984,,2405.00984.pdf,FREE: Faster and Better Data-Free Meta-Learning,"Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of +pre-trained models without requiring the original data, presenting practical +benefits in contexts constrained by data privacy concerns. Current DFML methods +primarily focus on the data recovery from these pre-trained models. However, +they suffer from slow recovery speed and overlook gaps inherent in +heterogeneous pre-trained models. In response to these challenges, we introduce +the Faster and Better Data-Free Meta-Learning (FREE) framework, which contains: +(i) a meta-generator for rapidly recovering training tasks from pre-trained +models; and (ii) a meta-learner for generalizing to new unseen tasks. +Specifically, within the module Faster Inversion via Meta-Generator, each +pre-trained model is perceived as a distinct task. The meta-generator can +rapidly adapt to a specific task in just five steps, significantly accelerating +the data recovery. Furthermore, we propose Better Generalization via +Meta-Learner and introduce an implicit gradient alignment algorithm to optimize +the meta-learner. This is achieved as aligned gradient directions alleviate +potential conflicts among tasks from heterogeneous pre-trained models. +Empirical experiments on multiple benchmarks affirm the superiority of our +approach, marking a notable speed-up (20$\times$) and performance enhancement +(1.42\% $\sim$ 4.78\%) in comparison to the state-of-the-art.",cs.LG,"['cs.LG', 'cs.CV']" +HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation,Ce Zhang · Simon Stepputtis · Joseph Campbell · Katia Sycara · Yaqi Xie,https://zhangce01.github.io/HiKER-SGG/,https://arxiv.org/abs/2403.12033,,2403.12033.pdf,HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation,"Being able to understand visual scenes is a precursor for many downstream +tasks, including autonomous driving, robotics, and other vision-based +approaches. A common approach enabling the ability to reason over visual data +is Scene Graph Generation (SGG); however, many existing approaches assume +undisturbed vision, i.e., the absence of real-world corruptions such as fog, +snow, smoke, as well as non-uniform perturbations like sun glare or water +drops. In this work, we propose a novel SGG benchmark containing procedurally +generated weather corruptions and other transformations over the Visual Genome +dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge +Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline +for scene graph generation under such challenging setting. At its core, +HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its +predictions from coarse initial estimates to detailed predictions. In our +extensive experiments, we show that HiKER-SGG does not only demonstrate +superior performance on corrupted images in a zero-shot manner, but also +outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is +available at https://github.com/zhangce01/HiKER-SGG.",cs.CV,['cs.CV'] +Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models,Xin Li · Yunfei Wu · Xinghua Jiang · ZhiHao Guo · Mingming Gong · Haoyu Cao · Yinsong Liu · Deqiang Jiang · Xing Sun, ,https://arxiv.org/abs/2402.19014,,2402.19014.pdf,Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models,"Recently, the advent of Large Visual-Language Models (LVLMs) has received +increasing attention across various domains, particularly in the field of +visual document understanding (VDU). Different from conventional +vision-language tasks, VDU is specifically concerned with text-rich scenarios +containing abundant document elements. Nevertheless, the importance of +fine-grained features remains largely unexplored within the community of LVLMs, +leading to suboptimal performance in text-rich scenarios. In this paper, we +abbreviate it as the fine-grained feature collapse issue. With the aim of +filling this gap, we propose a contrastive learning framework, termed Document +Object COntrastive learning (DoCo), specifically tailored for the downstream +tasks of VDU. DoCo leverages an auxiliary multimodal encoder to obtain the +features of document objects and align them to the visual features generated by +the vision encoder of LVLM, which enhances visual representation in text-rich +scenarios. It can represent that the contrastive learning between the visual +holistic representations and the multimodal fine-grained features of document +objects can assist the vision encoder in acquiring more effective visual cues, +thereby enhancing the comprehension of text-rich documents in LVLMs. We also +demonstrate that the proposed DoCo serves as a plug-and-play pre-training +method, which can be employed in the pre-training of various LVLMs without +inducing any increase in computational complexity during the inference process. +Extensive experimental results on multiple benchmarks of VDU reveal that LVLMs +equipped with our proposed DoCo can achieve superior performance and mitigate +the gap between VDU and generic vision-language tasks.",cs.CV,['cs.CV'] +PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF,Yutao Feng · Yintong Shang · Xuan Li · Tianjia Shao · Chenfanfu Jiang · Yin Yang,https://fytalon.github.io/pienerf/,https://arxiv.org/abs/2311.13099,,2311.13099.pdf,PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF,"We show that physics-based simulations can be seamlessly integrated with NeRF +to generate high-quality elastodynamics of real-world objects. Unlike existing +methods, we discretize nonlinear hyperelasticity in a meshless way, obviating +the necessity for intermediate auxiliary shape proxies like a tetrahedral mesh +or voxel grid. A quadratic generalized moving least square (Q-GMLS) is employed +to capture nonlinear dynamics and large deformation on the implicit model. Such +meshless integration enables versatile simulations of complex and codimensional +shapes. We adaptively place the least-square kernels according to the NeRF +density field to significantly reduce the complexity of the nonlinear +simulation. As a result, physically realistic animations can be conveniently +synthesized using our method for a wide range of hyperelastic materials at an +interactive rate. For more information, please visit our project page at +https://fytalon.github.io/pienerf/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']" +DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling,Linqi Zhou · Andy Shih · Chenlin Meng · Stefano Ermon, ,https://arxiv.org/abs/2311.17082,,2311.17082.pdf,DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling,"Recent methods such as Score Distillation Sampling (SDS) and Variational +Score Distillation (VSD) using 2D diffusion models for text-to-3D generation +have demonstrated impressive generation quality. However, the long generation +time of such algorithms significantly degrades the user experience. To tackle +this problem, we propose DreamPropeller, a drop-in acceleration algorithm that +can be wrapped around any existing text-to-3D generation pipeline based on +score distillation. Our framework generalizes Picard iterations, a classical +algorithm for parallel sampling an ODE path, and can account for non-ODE paths +such as momentum-based gradient updates and changes in dimensions during the +optimization process as in many cases of 3D generation. We show that our +algorithm trades parallel compute for wallclock time and empirically achieves +up to 4.7x speedup with a negligible drop in generation quality for all tested +frameworks.",cs.CV,"['cs.CV', 'stat.ML']" +RepViT: Revisiting Mobile CNN From ViT Perspective,Ao Wang · Hui Chen · Zijia Lin · Jungong Han · Guiguang Ding,https://github.com/THU-MIG/RepViT,https://arxiv.org/abs/2307.09283,,2307.09283.pdf,RepViT: Revisiting Mobile CNN From ViT Perspective,"Recently, lightweight Vision Transformers (ViTs) demonstrate superior +performance and lower latency, compared with lightweight Convolutional Neural +Networks (CNNs), on resource-constrained mobile devices. Researchers have +discovered many structural connections between lightweight ViTs and lightweight +CNNs. However, the notable architectural disparities in the block structure, +macro, and micro designs between them have not been adequately examined. In +this study, we revisit the efficient design of lightweight CNNs from ViT +perspective and emphasize their promising prospect for mobile devices. +Specifically, we incrementally enhance the mobile-friendliness of a standard +lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural +designs of lightweight ViTs. This ends up with a new family of pure lightweight +CNNs, namely RepViT. Extensive experiments show that RepViT outperforms +existing state-of-the-art lightweight ViTs and exhibits favorable latency in +various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 +accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a +lightweight model, to the best of our knowledge. Besides, when RepViT meets +SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the +advanced MobileSAM. Codes and models are available at +\url{https://github.com/THU-MIG/RepViT}.",cs.CV,['cs.CV'] +Neural Video Compression with Feature Modulation,Jiahao Li · Bin Li · Yan Lu, ,https://arxiv.org/abs/2402.17414v1,,2402.17414v1.pdf,Neural Video Compression with Feature Modulation,"The emerging conditional coding-based neural video codec (NVC) shows +superiority over commonly-used residual coding-based codec and the latest NVC +already claims to outperform the best traditional codec. However, there still +exist critical problems blocking the practicality of NVC. In this paper, we +propose a powerful conditional coding-based NVC that solves two critical +problems via feature modulation. The first is how to support a wide quality +range in a single model. Previous NVC with this capability only supports about +3.8 dB PSNR range on average. To tackle this limitation, we modulate the latent +feature of the current frame via the learnable quantization scaler. During the +training, we specially design the uniform quantization parameter sampling +mechanism to improve the harmonization of encoding and quantization. This +results in a better learning of the quantization scaler and helps our NVC +support about 11.4 dB PSNR range. The second is how to make NVC still work +under a long prediction chain. We expose that the previous SOTA NVC has an +obvious quality degradation problem when using a large intra-period setting. To +this end, we propose modulating the temporal feature with a periodically +refreshing mechanism to boost the quality. %Besides solving the above two +problems, we also design a single model that can support both RGB and YUV +colorspaces. Notably, under single intra-frame setting, our codec can achieve +29.7\% bitrate saving over previous SOTA NVC with 16\% MACs reduction. Our +codec serves as a notable landmark in the journey of NVC evolution. The codes +are at https://github.com/microsoft/DCVC.",cs.CV,"['cs.CV', 'eess.IV']" +Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,Junjie Chen · Jiebin Yan · Yuming Fang · Li Niu, ,https://arxiv.org/abs/2403.13647,,2403.13647.pdf,Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,"Category-agnostic pose estimation (CAPE) aims to predict keypoints for +arbitrary classes given a few support images annotated with keypoints. Existing +methods only rely on the features extracted at support keypoints to predict or +refine the keypoints on query image, but a few support feature vectors are +local and inadequate for CAPE. Considering that human can quickly perceive +potential keypoints of arbitrary objects, we propose a novel framework for CAPE +based on such potential keypoints (named as meta-points). Specifically, we +maintain learnable embeddings to capture inherent information of various +keypoints, which interact with image feature maps to produce meta-points +without any support. The produced meta-points could serve as meaningful +potential keypoints for CAPE. Due to the inevitable gap between inherency and +annotation, we finally utilize the identities and details offered by support +keypoints to assign and refine meta-points to desired keypoints in query image. +In addition, we propose a progressive deformable point decoder and a slacked +regression loss for better prediction and supervision. Our novel framework not +only reveals the inherency of keypoints but also outperforms existing methods +of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100 +dataset demonstrate the effectiveness of our framework.",cs.CV,['cs.CV'] +Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization,Insoo Kim · Jae Seok Choi · Geonseok Seo · Kinam Kwon · Jinwoo Shin · Hyong-Euk Lee, ,https://arxiv.org/abs/2404.12168,,2404.12168.pdf,Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization,"As recent advances in mobile camera technology have enabled the capability to +capture high-resolution images, such as 4K images, the demand for an efficient +deblurring model handling large motion has increased. In this paper, we +discover that the image residual errors, i.e., blur-sharp pixel differences, +can be grouped into some categories according to their motion blur type and how +complex their neighboring pixels are. Inspired by this, we decompose the +deblurring (regression) task into blur pixel discretization (pixel-level blur +classification) and discrete-to-continuous conversion (regression with blur +class map) tasks. Specifically, we generate the discretized image residual +errors by identifying the blur pixels and then transform them to a continuous +form, which is computationally more efficient than naively solving the original +regression problem with continuous values. Here, we found that the +discretization result, i.e., blur segmentation map, remarkably exhibits visual +similarity with the image residual errors. As a result, our efficient model +shows comparable performance to state-of-the-art methods in realistic +benchmarks, while our method is up to 10 times computationally more efficient.",cs.CV,"['cs.CV', 'cs.AI']" +Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention,Xingyu Zhou · Leheng Zhang · Xiaorui Zhao · Keze Wang · Leida Li · Shuhang Gu, ,https://arxiv.org/abs/2401.06312,,2401.06312.pdf,Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention,"Recently, Vision Transformer has achieved great success in recovering missing +details in low-resolution sequences, i.e., the video super-resolution (VSR) +task. Despite its superiority in VSR accuracy, the heavy computational burden +as well as the large memory footprint hinder the deployment of +Transformer-based VSR models on constrained devices. In this paper, we address +the above issue by proposing a novel feature-level masked processing framework: +VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR +is leveraging feature-level temporal continuity between adjacent frames to +reduce redundant computations and make more rational use of previously enhanced +SR features. Concretely, we propose an intra-frame and inter-frame attention +block which takes the respective roles of past features and input features into +consideration and only exploits previously enhanced features to provide +supplementary information. In addition, an adaptive block-wise mask prediction +module is developed to skip unimportant computations according to feature +similarity between adjacent frames. We conduct detailed ablation studies to +validate our contributions and compare the proposed method with recent +state-of-the-art VSR approaches. The experimental results demonstrate that +MIA-VSR improves the memory and computation efficiency over state-of-the-art +methods, without trading off PSNR accuracy. The code is available at +https://github.com/LabShuHangGU/MIA-VSR.",cs.CV,['cs.CV'] +Any-Shift Prompting for Generalization over Distributions,Zehao Xiao · Jiayi Shen · Mohammad Mahdi Derakhshani · Shengcai Liao · Cees G. M. Snoek, ,https://arxiv.org/abs/2402.10099,,2402.10099.pdf,Any-Shift Prompting for Generalization over Distributions,"Image-language models with prompt learning have shown remarkable advances in +numerous downstream vision tasks. Nevertheless, conventional prompt learning +methods overfit their training distribution and lose the generalization ability +on test distributions. To improve generalization across various distribution +shifts, we propose any-shift prompting: a general probabilistic inference +framework that considers the relationship between training and test +distributions during prompt learning. We explicitly connect training and test +distributions in the latent space by constructing training and test prompts in +a hierarchical architecture. Within this framework, the test prompt exploits +the distribution relationships to guide the generalization of the CLIP +image-language model from training to any test distribution. To effectively +encode the distribution information and their relationships, we further +introduce a transformer inference network with a pseudo-shift training +mechanism. The network generates the tailored test prompt with both training +and test information in a feedforward pass, avoiding extra training costs at +test time. Extensive experiments on twenty-three datasets demonstrate the +effectiveness of any-shift prompting on the generalization over various +distribution shifts.",cs.CV,['cs.CV'] +Mosaic-SDF for 3D Generative Models,Lior Yariv · Omri Puny · Oran Gafni · Yaron Lipman,https://lioryariv.github.io/msdf/,https://arxiv.org/abs/2312.09222,,2312.09222.pdf,Mosaic-SDF for 3D Generative Models,"Current diffusion or flow-based generative models for 3D shapes divide to +two: distilling pre-trained 2D image diffusion models, and training directly on +3D shapes. When training a diffusion or flow models on 3D shapes a crucial +design choice is the shape representation. An effective shape representation +needs to adhere three design principles: it should allow an efficient +conversion of large 3D datasets to the representation form; it should provide a +good tradeoff of approximation power versus number of parameters; and it should +have a simple tensorial form that is compatible with existing powerful neural +architectures. While standard 3D shape representations such as volumetric grids +and point clouds do not adhere to all these principles simultaneously, we +advocate in this paper a new representation that does. We introduce Mosaic-SDF +(M-SDF): a simple 3D shape representation that approximates the Signed Distance +Function (SDF) of a given shape by using a set of local grids spread near the +shape's boundary. The M-SDF representation is fast to compute for each shape +individually making it readily parallelizable; it is parameter efficient as it +only covers the space around the shape's boundary; and it has a simple matrix +form, compatible with Transformer-based architectures. We demonstrate the +efficacy of the M-SDF representation by using it to train a 3D generative flow +model including class-conditioned generation with the 3D Warehouse dataset, and +text-to-3D generation using a dataset of about 600k caption-shape pairs.",cs.CV,"['cs.CV', 'cs.GR']" +Fourier-basis functions to bridge augmentation gap: Rethinking frequency augmentation in image classification,Mei Vaish · Shunxin Wang · Nicola Strisciuglio,https://github.com/nis-research/afa-augment,https://arxiv.org/abs/2403.01944,,2403.01944.pdf,Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification,"Computer vision models normally witness degraded performance when deployed in +real-world scenarios, due to unexpected changes in inputs that were not +accounted for during training. Data augmentation is commonly used to address +this issue, as it aims to increase data variety and reduce the distribution gap +between training and test data. However, common visual augmentations might not +guarantee extensive robustness of computer vision models. In this paper, we +propose Auxiliary Fourier-basis Augmentation (AFA), a complementary technique +targeting augmentation in the frequency domain and filling the augmentation gap +left by visual augmentations. We demonstrate the utility of augmentation via +Fourier-basis additive noise in a straightforward and efficient adversarial +setting. Our results show that AFA benefits the robustness of models against +common corruptions, OOD generalization, and consistency of performance of +models against increasing perturbations, with negligible deficit to the +standard performance of models. It can be seamlessly integrated with other +augmentation techniques to further boost performance. Code and models can be +found at: https://github.com/nis-research/afa-augment",cs.CV,"['cs.CV', 'cs.LG']" +CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning,Hyuck Lee · Heeyoung Kim, ,https://arxiv.org/abs/2403.10391,,2403.10391.pdf,CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning,"Pseudo-label-based semi-supervised learning (SSL) algorithms trained on a +class-imbalanced set face two cascading challenges: 1) Classifiers tend to be +biased towards majority classes, and 2) Biased pseudo-labels are used for +training. It is difficult to appropriately re-balance the classifiers in SSL +because the class distribution of an unlabeled set is often unknown and could +be mismatched with that of a labeled set. We propose a novel class-imbalanced +SSL algorithm called class-distribution-mismatch-aware debiasing (CDMAD). For +each iteration of training, CDMAD first assesses the classifier's biased degree +towards each class by calculating the logits on an image without any patterns +(e.g., solid color image), which can be considered irrelevant to the training +set. CDMAD then refines biased pseudo-labels of the base SSL algorithm by +ensuring the classifier's neutrality. CDMAD uses these refined pseudo-labels +during the training of the base SSL algorithm to improve the quality of the +representations. In the test phase, CDMAD similarly refines biased class +predictions on test samples. CDMAD can be seen as an extension of post-hoc +logit adjustment to address a challenge of incorporating the unknown class +distribution of the unlabeled set for re-balancing the biased classifier under +class distribution mismatch. CDMAD ensures Fisher consistency for the balanced +error. Extensive experiments verify the effectiveness of CDMAD.",cs.CV,['cs.CV'] +LoS: Local Structure Guided Stereo Matching,Kunhong Li · Longguang Wang · Ye Zhang · Kaiwen Xue · Shunbo Zhou · Yulan Guo, ,https://ar5iv.labs.arxiv.org/html/2309.16992,,2309.16992.pdf,Segment Anything Model is a Good Teacher for Local Feature Learning,"Local feature detection and description play an important role in many +computer vision tasks, which are designed to detect and describe keypoints in +""any scene"" and ""any downstream task"". Data-driven local feature learning +methods need to rely on pixel-level correspondence for training, which is +challenging to acquire at scale, thus hindering further improvements in +performance. In this paper, we propose SAMFeat to introduce SAM (segment +anything model), a fundamental model trained on 11 million images, as a teacher +to guide local feature learning and thus inspire higher performance on limited +datasets. To do so, first, we construct an auxiliary task of Pixel Semantic +Relational Distillation (PSRD), which distillates feature relations with +category-agnostic semantic information learned by the SAM encoder into a local +feature learning network, to improve local feature description using semantic +discrimination. Second, we develop a technique called Weakly Supervised +Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic +groupings derived from SAM as weakly supervised signals, to optimize the metric +space of local descriptors. Third, we design an Edge Attention Guidance (EAG) +to further improve the accuracy of local feature detection and description by +prompting the network to pay more attention to the edge region guided by SAM. +SAMFeat's performance on various tasks such as image matching on HPatches, and +long-term visual localization on Aachen Day-Night showcases its superiority +over previous local features. The release code is available at +https://github.com/vignywang/SAMFeat.",cs.CV,"['cs.CV', 'cs.LG']" +Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data,Xinting Liao · Weiming Liu · Chaochao Chen · Pengyang Zhou · Fengyuan Yu · Huabin Zhu · Binhui Yao · Tao Wang · Xiaolin Zheng · Yanchao Tan, ,https://arxiv.org/abs/2403.16398,,2403.16398.pdf,Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data,"Federated learning achieves effective performance in modeling decentralized +data. In practice, client data are not well-labeled, which makes it potential +for federated unsupervised learning (FUSL) with non-IID data. However, the +performance of existing FUSL methods suffers from insufficient representations, +i.e., (1) representation collapse entanglement among local and global models, +and (2) inconsistent representation spaces among local models. The former +indicates that representation collapse in local model will subsequently impact +the global model and other local models. The latter means that clients model +data representation with inconsistent parameters due to the deficiency of +supervision signals. In this work, we propose FedU2 which enhances generating +uniform and unified representation in FUSL with non-IID data. Specifically, +FedU2 consists of flexible uniform regularizer (FUR) and efficient unified +aggregator (EUA). FUR in each client avoids representation collapse via +dispersing samples uniformly, and EUA in server promotes unified representation +by constraining consistent client model updating. To extensively validate the +performance of FedU2, we conduct both cross-device and cross-silo evaluation +experiments on two benchmark datasets, i.e., CIFAR10 and CIFAR100.",cs.LG,"['cs.LG', 'cs.AI']" +Towards Detailed and Robust 3D Clothed Human Reconstruction with High-Frequency and Low-Frequency Information of Parametric Body Models,Yifan Yang · Dong Liu · Shuhai Zhang · Zeshuai Deng · Zixiong Huang · Mingkui Tan, ,https://arxiv.org/abs/2404.04876,,2404.04876.pdf,HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models,"Reconstructing 3D clothed human involves creating a detailed geometry of +individuals in clothing, with applications ranging from virtual try-on, movies, +to games. To enable practical and widespread applications, recent advances +propose to generate a clothed human from an RGB image. However, they struggle +to reconstruct detailed and robust avatars simultaneously. We empirically find +that the high-frequency (HF) and low-frequency (LF) information from a +parametric model has the potential to enhance geometry details and improve +robustness to noise, respectively. Based on this, we propose HiLo, namely +clothed human reconstruction with high- and low-frequency information, which +contains two components. 1) To recover detailed geometry using HF information, +we propose a progressive HF Signed Distance Function to enhance the detailed 3D +geometry of a clothed human. We analyze that our progressive learning manner +alleviates large gradients that hinder model convergence. 2) To achieve robust +reconstruction against inaccurate estimation of the parametric model by using +LF information, we propose a spatial interaction implicit function. This +function effectively exploits the complementary spatial information from a +low-resolution voxel grid of the parametric model. Experimental results +demonstrate that HiLo outperforms the state-of-the-art methods by 10.43% and +9.54% in terms of Chamfer distance on the Thuman2.0 and CAPE datasets, +respectively. Additionally, HiLo demonstrates robustness to noise from the +parametric model, challenging poses, and various clothing styles.",cs.CV,['cs.CV'] +MS-DETR: Efficient DETR Training with Mixed Supervision,Chuyang Zhao · Yifan Sun · Wenhao Wang · Qiang Chen · Errui Ding · Yi Yang · Jingdong Wang,https://github.com/Atten4Vis/MS-DETR,https://arxiv.org/abs/2401.03989,,2401.03989.pdf,MS-DETR: Efficient DETR Training with Mixed Supervision,"DETR accomplishes end-to-end object detection through iteratively generating +multiple object candidates based on image features and promoting one candidate +for each ground-truth object. The traditional training procedure using +one-to-one supervision in the original DETR lacks direct supervision for the +object detection candidates. + We aim at improving the DETR training efficiency by explicitly supervising +the candidate generation procedure through mixing one-to-one supervision and +one-to-many supervision. Our approach, namely MS-DETR, is simple, and places +one-to-many supervision to the object queries of the primary decoder that is +used for inference. In comparison to existing DETR variants with one-to-many +supervision, such as Group DETR and Hybrid DETR, our approach does not need +additional decoder branches or object queries. The object queries of the +primary decoder in our approach directly benefit from one-to-many supervision +and thus are superior in object candidate prediction. Experimental results show +that our approach outperforms related DETR variants, such as DN-DETR, Hybrid +DETR, and Group DETR, and the combination with related DETR variants further +improves the performance.",cs.CV,['cs.CV'] +Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models,Yabin Zhang · Wenjie Zhu · Hui Tang · Zhiyuan Ma · Kaiyang Zhou · Lei Zhang,https://github.com/YBZh/DMN,https://arxiv.org/abs/2403.17589,,2403.17589.pdf,Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models,"With the emergence of pre-trained vision-language models like CLIP, how to +adapt them to various downstream classification tasks has garnered significant +attention in recent research. The adaptation strategies can be typically +categorized into three paradigms: zero-shot adaptation, few-shot adaptation, +and the recently-proposed training-free few-shot adaptation. Most existing +approaches are tailored for a specific setting and can only cater to one or two +of these paradigms. In this paper, we introduce a versatile adaptation approach +that can effectively work under all three settings. Specifically, we propose +the dual memory networks that comprise dynamic and static memory components. +The static memory caches training data knowledge, enabling training-free +few-shot adaptation, while the dynamic memory preserves historical test +features online during the testing process, allowing for the exploration of +additional data insights beyond the training set. This novel capability +enhances model performance in the few-shot setting and enables model usability +in the absence of training data. The two memory networks employ the same +flexible memory interactive strategy, which can operate in a training-free mode +and can be further enhanced by incorporating learnable projection layers. Our +approach is tested across 11 datasets under the three task settings. +Remarkably, in the zero-shot scenario, it outperforms existing methods by over +3\% and even shows superior results against methods utilizing external training +data. Additionally, our method exhibits robust performance against natural +distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']" +Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI,Sean I. Young · Yaël Balbastre · Bruce Fischl · Polina Golland · Juan Iglesias, ,https://arxiv.org/abs/2312.03102,,2312.03102.pdf,Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI,"In magnetic resonance imaging (MRI), slice-to-volume reconstruction (SVR) +refers to computational reconstruction of an unknown 3D magnetic resonance +volume from stacks of 2D slices corrupted by motion. While promising, current +SVR methods require multiple slice stacks for accurate 3D reconstruction, +leading to long scans and limiting their use in time-sensitive applications +such as fetal fMRI. Here, we propose a SVR method that overcomes the +shortcomings of previous work and produces state-of-the-art reconstructions in +the presence of extreme inter-slice motion. Inspired by the recent success of +single-view depth estimation methods, we formulate SVR as a single-stack motion +estimation task and train a fully convolutional network to predict a motion +stack for a given slice stack, producing a 3D reconstruction as a byproduct of +the predicted motion. Extensive experiments on the SVR of adult and fetal +brains demonstrate that our fully convolutional method is twice as accurate as +previous SVR methods. Our code is available at github.com/seannz/svr.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']" +Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers,Subhadeep Koley · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/DiffusionZSSBIR/,https://arxiv.org/abs/2403.07214,,2403.07214.pdf,Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers,"This paper, for the first time, explores text-to-image diffusion models for +Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal +discovery: the capacity of text-to-image diffusion models to seamlessly bridge +the gap between sketches and photos. This proficiency is underpinned by their +robust cross-modal capabilities and shape bias, findings that are substantiated +through our pilot studies. In order to harness pre-trained diffusion models +effectively, we introduce a straightforward yet powerful strategy focused on +two key aspects: selecting optimal feature layers and utilising visual and +textual prompts. For the former, we identify which layers are most enriched +with information and are best suited for the specific retrieval requirements +(category-level or fine-grained). Then we employ visual and textual prompts to +guide the model's feature extraction process, enabling it to generate more +discriminative and contextually relevant cross-modal representations. Extensive +experiments on several benchmark datasets validate significant performance +improvements.",cs.CV,['cs.CV'] +Enhance Image Classification Via Inter-Class Image Mixup With Diffusion Model,Zhicai Wang · Longhui Wei · Tan Wang · Heyu Chen · Yanbin Hao · Xiang Wang · Xiangnan He · Qi Tian, ,https://arxiv.org/abs/2403.19600,,2403.19600.pdf,Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model,"Text-to-image (T2I) generative models have recently emerged as a powerful +tool, enabling the creation of photo-realistic images and giving rise to a +multitude of applications. However, the effective integration of T2I models +into fundamental image classification tasks remains an open question. A +prevalent strategy to bolster image classification performance is through +augmenting the training set with synthetic images generated by T2I models. In +this study, we scrutinize the shortcomings of both current generative and +conventional data augmentation techniques. Our analysis reveals that these +methods struggle to produce images that are both faithful (in terms of +foreground objects) and diverse (in terms of background contexts) for +domain-specific concepts. To tackle this challenge, we introduce an innovative +inter-class data augmentation method known as Diff-Mix +(https://github.com/Zhicaiwww/Diff-Mix), which enriches the dataset by +performing image translations between classes. Our empirical results +demonstrate that Diff-Mix achieves a better balance between faithfulness and +diversity, leading to a marked improvement in performance across diverse image +classification scenarios, including few-shot, conventional, and long-tail +classifications for domain-specific datasets.",cs.CV,['cs.CV'] +Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata,Dongsu Zhang · Francis Williams · Žan Gojčič · Karsten Kreis · Sanja Fidler · Young Min Kim · Amlan Kar, ,,https://www.tandfonline.com/doi/full/10.1080/15481603.2023.2290352,,,,,nan +3D Neural Edge Reconstruction,Lei Li · Songyou Peng · Zehao Yu · Shaohui Liu · Rémi Pautrat · Xiaochuan Yin · Marc Pollefeys,https://neural-edge-map.github.io/,https://arxiv.org/abs/2405.19295,,2405.19295.pdf,3D Neural Edge Reconstruction,"Real-world objects and environments are predominantly composed of edge +features, including straight lines and curves. Such edges are crucial elements +for various applications, such as CAD modeling, surface meshing, lane mapping, +etc. However, existing traditional methods only prioritize lines over curves +for simplicity in geometric modeling. To this end, we introduce EMAP, a new +method for learning 3D edge representations with a focus on both lines and +curves. Our method implicitly encodes 3D edge distance and direction in +Unsigned Distance Functions (UDF) from multi-view edge maps. On top of this +neural representation, we propose an edge extraction algorithm that robustly +abstracts parametric 3D edges from the inferred edge points and their +directions. Comprehensive evaluations demonstrate that our method achieves +better 3D edge reconstruction on multiple challenging datasets. We further show +that our learned UDF field enhances neural surface reconstruction by capturing +more details.",cs.CV,['cs.CV'] +ProMark: Proactive Diffusion Watermarking for Causal Attribution,Vishal Asnani · John Collomosse · Tu Bui · Xiaoming Liu · Shruti Agarwal, ,https://arxiv.org/abs/2403.09914,,2403.09914.pdf,ProMark: Proactive Diffusion Watermarking for Causal Attribution,"Generative AI (GenAI) is transforming creative workflows through the +capability to synthesize and manipulate images via high-level prompts. Yet +creatives are not well supported to receive recognition or reward for the use +of their content in GenAI training. To this end, we propose ProMark, a causal +attribution technique to attribute a synthetically generated image to its +training data concepts like objects, motifs, templates, artists, or styles. The +concept information is proactively embedded into the input training images +using imperceptible watermarks, and the diffusion models (unconditional or +conditional) are trained to retain the corresponding watermarks in generated +images. We show that we can embed as many as $2^{16}$ unique watermarks into +the training data, and each training image can contain more than one watermark. +ProMark can maintain image quality whilst outperforming correlation-based +attribution. Finally, several qualitative examples are presented, providing the +confidence that the presence of the watermark conveys a causative relationship +between training data and synthetic images.",cs.CV,['cs.CV'] +Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior,Chen Cheng · Xiaofeng Yang · Fan Yang · Chengzeng Feng · ZHOUJIE FU · Chuan-Sheng Foo · Guosheng Lin · Fayao Liu, ,https://arxiv.org/abs/2403.09140,,2403.09140.pdf,Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior,"Recent works on text-to-3d generation show that using only 2D diffusion +supervision for 3D generation tends to produce results with inconsistent +appearances (e.g., faces on the back view) and inaccurate shapes (e.g., animals +with extra legs). Existing methods mainly address this issue by retraining +diffusion models with images rendered from 3D data to ensure multi-view +consistency while struggling to balance 2D generation quality with 3D +consistency. In this paper, we present a new framework Sculpt3D that equips the +current pipeline with explicit injection of 3D priors from retrieved reference +objects without re-training the 2D diffusion model. Specifically, we +demonstrate that high-quality and diverse 3D geometry can be guaranteed by +keypoints supervision through a sparse ray sampling approach. Moreover, to +ensure accurate appearances of different views, we further modulate the output +of the 2D diffusion model to the correct patterns of the template views without +altering the generated object's style. These two decoupled designs effectively +harness 3D information from reference objects to generate 3D objects while +preserving the generation quality of the 2D diffusion model. Extensive +experiments show our method can largely improve the multi-view consistency +while retaining fidelity and diversity. Our project page is available at: +https://stellarcheng.github.io/Sculpt3D/.",cs.CV,['cs.CV'] +Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance,Yu · Jie Huang · Li · Kaiwen Zheng · Qi Zhu · Man Zhou · Feng Zhao, ,,https://github.com/YPatrickW/LMAR,,,,,nan +You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval,Subhadeep Koley · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/Sketch2Word/,https://arxiv.org/abs/2403.07222v2,,2403.07222v2.pdf,You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval,"Two primary input modalities prevail in image retrieval: sketch and text. +While text is widely used for inter-category retrieval tasks, sketches have +been established as the sole preferred modality for fine-grained image +retrieval due to their ability to capture intricate visual details. In this +paper, we question the reliance on sketches alone for fine-grained image +retrieval by simultaneously exploring the fine-grained representation +capabilities of both sketch and text, orchestrating a duet between the two. The +end result enables precise retrievals previously unattainable, allowing users +to pose ever-finer queries and incorporate attributes like colour and +contextual cues from text. For this purpose, we introduce a novel +compositionality framework, effectively combining sketches and text using +pre-trained CLIP models, while eliminating the need for extensive fine-grained +textual descriptions. Last but not least, our system extends to novel +applications in composed image retrieval, domain attribute transfer, and +fine-grained generation, providing solutions for various real-world scenarios.",cs.CV,['cs.CV'] +Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos,Kumaranage Ravindu Nagasinghe · Honglu Zhou · Malitha Gunawardhana · Martin Renqiang Min · Daniel Harari · Muhammad Haris Khan,https://ravindu-yasas-nagasinghe.github.io/KEPP-Project_Page/,https://arxiv.org/abs/2403.02782,,2403.02782.pdf,Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos,"In this paper, we explore the capability of an agent to construct a logical +sequence of action steps, thereby assembling a strategic procedural plan. This +plan is crucial for navigating from an initial visual observation to a target +visual outcome, as depicted in real-life instructional videos. Existing works +have attained partial success by extensively leveraging various sources of +information available in the datasets, such as heavy intermediate visual +observations, procedural names, or natural language step-by-step instructions, +for features or supervision signals. However, the task remains formidable due +to the implicit causal constraints in the sequencing of steps and the +variability inherent in multiple feasible plans. To tackle these intricacies +that previous efforts have overlooked, we propose to enhance the capabilities +of the agent by infusing it with procedural knowledge. This knowledge, sourced +from training procedure plans and structured as a directed weighted graph, +equips the agent to better navigate the complexities of step sequencing and its +potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced +Procedure Planning system, which harnesses a probabilistic procedural knowledge +graph extracted from training data, effectively acting as a comprehensive +textbook for the training domain. Experimental evaluations across three +widely-used datasets under settings of varying complexity reveal that KEPP +attains superior, state-of-the-art results while requiring only minimal +supervision.",cs.CV,['cs.CV'] +Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes,Ziqian Bai · Feitong Tan · Sean Fanello · Rohit Pandey · Mingsong Dou · Shichen Liu · Ping Tan · Yinda Zhang,https://augmentedperception.github.io/monoavatar-plus/,https://arxiv.org/abs/2404.01543,,2404.01543.pdf,Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes,"3D head avatars built with neural implicit volumetric representations have +achieved unprecedented levels of photorealism. However, the computational cost +of these methods remains a significant barrier to their widespread adoption, +particularly in real-time applications such as virtual reality and +teleconferencing. While attempts have been made to develop fast neural +rendering approaches for static scenes, these methods cannot be simply employed +to support realistic facial expressions, such as in the case of a dynamic +facial performance. To address these challenges, we propose a novel fast 3D +neural implicit head avatar model that achieves real-time rendering while +maintaining fine-grained controllability and high rendering quality. Our key +idea lies in the introduction of local hash table blendshapes, which are +learned and attached to the vertices of an underlying face parametric model. +These per-vertex hash-tables are linearly merged with weights predicted via a +CNN, resulting in expression dependent embeddings. Our novel representation +enables efficient density and color predictions using a lightweight MLP, which +is further accelerated by a hierarchical nearest neighbor search method. +Extensive experiments show that our approach runs in real-time while achieving +comparable rendering quality to state-of-the-arts and decent results on +challenging expressions.",cs.CV,"['cs.CV', 'cs.GR']" +"Towards Co-Evaluation of Cameras, HDR, and Algorithms for Industrial-Grade 6DoF Pose Estimation",Agastya Kalra · Guy Stoppi · Dmitrii Marin · Vage Taamazyan · Aarrushi Shandilya · Rishav Agarwal · Anton Boykov · Aaron Chong · Michael Stark,https://github.com/intrinsic-ai/ipd,https://arxiv.org/abs/2403.03221,,2403.03221.pdf,"FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation","Estimating relative camera poses between images has been a central problem in +computer vision. Methods that find correspondences and solve for the +fundamental matrix offer high precision in most cases. Conversely, methods +predicting pose directly using neural networks are more robust to limited +overlap and can infer absolute translation scale, but at the expense of reduced +precision. We show how to combine the best of both methods; our approach yields +results that are both precise and robust, while also accurately inferring +translation scales. At the heart of our model lies a Transformer that (1) +learns to balance between solved and learned pose estimations, and (2) provides +a prior to guide a solver. A comprehensive analysis supports our design choices +and demonstrates that our method adapts flexibly to various feature extractors +and correspondence estimators, showing state-of-the-art performance in 6DoF +pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free +Relocalization.",cs.CV,['cs.CV'] +A Generative Approach for Wikipedia-Scale Visual Entity Recognition,Mathilde Caron · Ahmet Iscen · Alireza Fathi · Cordelia Schmid,https://github.com/google-research/scenic/tree/main/scenic/projects/gerald,https://arxiv.org/abs/2403.02041,,2403.02041.pdf,A Generative Approach for Wikipedia-Scale Visual Entity Recognition,"In this paper, we address web-scale visual entity recognition, specifically +the task of mapping a given query image to one of the 6 million existing +entities in Wikipedia. One way of approaching a problem of such scale is using +dual-encoder models (eg CLIP), where all the entity names and query images are +embedded into a unified space, paving the way for an approximate k-NN search. +Alternatively, it is also possible to re-purpose a captioning model to directly +generate the entity names for a given image. In contrast, we introduce a novel +Generative Entity Recognition (GER) framework, which given an input image +learns to auto-regressively decode a semantic and discriminative ``code'' +identifying the target entity. Our experiments demonstrate the efficacy of this +GER paradigm, showcasing state-of-the-art performance on the challenging OVEN +benchmark. GER surpasses strong captioning, dual-encoder, visual matching and +hierarchical classification baselines, affirming its advantage in tackling the +complexities of web-scale recognition.",cs.CV,['cs.CV'] +How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?,Subhadeep Koley · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/AbstractAway/,https://arxiv.org/abs/2403.07203,,2403.07203.pdf,How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?,"In this paper, we propose a novel abstraction-aware sketch-based image +retrieval framework capable of handling sketch abstraction at varied levels. +Prior works had mainly focused on tackling sub-factors such as drawing style +and order, we instead attempt to model abstraction as a whole, and propose +feature-level and retrieval granularity-level designs so that the system builds +into its DNA the necessary means to interpret abstraction. On learning +abstraction-aware features, we for the first-time harness the rich semantic +embedding of pre-trained StyleGAN model, together with a novel +abstraction-level mapper that deciphers the level of abstraction and +dynamically selects appropriate dimensions in the feature matrix +correspondingly, to construct a feature matrix embedding that can be freely +traversed to accommodate different levels of abstraction. For granularity-level +abstraction understanding, we dictate that the retrieval model should not treat +all abstraction-levels equally and introduce a differentiable surrogate Acc.@q +loss to inject that understanding into the system. Different to the +gold-standard triplet loss, our Acc.@q loss uniquely allows a sketch to +narrow/broaden its focus in terms of how stringent the evaluation should be - +the more abstract a sketch, the less stringent (higher q). Extensive +experiments depict our method to outperform existing state-of-the-arts in +standard SBIR tasks along with challenging scenarios like early retrieval, +forensic sketch-photo matching, and style-invariant retrieval.",cs.CV,['cs.CV'] +A Recipe for Scaling up Text-to-Video Generation with Text-free Videos,Xiang Wang · Shiwei Zhang · Hangjie Yuan · Zhiwu Qing · Biao Gong · Yingya Zhang · Yujun Shen · Changxin Gao · Nong Sang,https://tf-t2v.github.io/,https://arxiv.org/abs/2312.15770,,2312.15770.pdf,A Recipe for Scaling up Text-to-Video Generation with Text-free Videos,"Diffusion-based text-to-video generation has witnessed impressive progress in +the past year yet still falls behind text-to-image generation. One of the key +reasons is the limited scale of publicly available data (e.g., 10M video-text +pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost +of video captioning. Instead, it could be far easier to collect unlabeled clips +from video platforms like YouTube. Motivated by this, we come up with a novel +text-to-video generation framework, termed TF-T2V, which can directly learn +with text-free videos. The rationale behind is to separate the process of text +decoding from that of temporal modeling. To this end, we employ a content +branch and a motion branch, which are jointly optimized with weights shared. +Following such a pipeline, we study the effect of doubling the scale of +training set (i.e., video-only WebVid10M) with some randomly collected +text-free videos and are encouraged to observe the performance improvement (FID +from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of +our approach. We also find that our model could enjoy sustainable performance +gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some +text labels for training. Finally, we validate the effectiveness and +generalizability of our ideology on both native text-to-video generation and +compositional video synthesis paradigms. Code and models will be publicly +available at https://tf-t2v.github.io/.",cs.CV,"['cs.CV', 'cs.AI']" +HOIAnimator: Text-Prompt Human-Object Animations Generation with Perceptive Diffusion Models,Wenfeng Song · Xinyu Zhang · Shuai Li · Yang Gao · Aimin Hao · Xia HOU · Chenglizhao Chen · Ning Li · Hong Qin, ,https://arxiv.org/abs/2312.06553,,2312.06553.pdf,HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models,"We address the problem of generating realistic 3D human-object interactions +(HOIs) driven by textual prompts. To this end, we take a modular design and +decompose the complex task into simpler sub-tasks. We first develop a +dual-branch diffusion model (HOI-DM) to generate both human and object motions +conditioned on the input text, and encourage coherent motions by a +cross-attention communication module between the human and object motion +generation branches. We also develop an affordance prediction diffusion model +(APDM) to predict the contacting area between the human and object during the +interactions driven by the textual prompt. The APDM is independent of the +results by the HOI-DM and thus can correct potential errors by the latter. +Moreover, it stochastically generates the contacting points to diversify the +generated motions. Finally, we incorporate the estimated contacting points into +the classifier-guidance to achieve accurate and close contact between humans +and objects. To train and evaluate our approach, we annotate BEHAVE dataset +with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate +that our approach produces realistic HOIs with various interactions and +different types of objects.",cs.CV,['cs.CV'] +Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement,Ziyu Wang · Yue Xu · Cewu Lu · Yonglu Li, ,https://arxiv.org/abs/2312.00362,,2312.00362.pdf,Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement,"Recently, dataset distillation has paved the way towards efficient machine +learning, especially for image datasets. However, the distillation for videos, +characterized by an exclusive temporal dimension, remains an underexplored +domain. In this work, we provide the first systematic study of video +distillation and introduce a taxonomy to categorize temporal compression. Our +investigation reveals that the temporal information is usually not well learned +during distillation, and the temporal dimension of synthetic data contributes +little. The observations motivate our unified framework of disentangling the +dynamic and static information in the videos. It first distills the videos into +still images as static memory and then compensates the dynamic and motion +information with a learnable dynamic memory block. Our method achieves +state-of-the-art on video datasets at different scales, with a notably smaller +memory storage budget. Our code is available at +https://github.com/yuz1wan/video_distillation.",cs.CV,"['cs.CV', 'cs.LG']" +Readout Guidance: Learning Control from Diffusion Features,Grace Luo · Trevor Darrell · Oliver Wang · Dan B Goldman · Aleksander Holynski,https://readout-guidance.github.io,https://arxiv.org/abs/2312.02150,,2312.02150.pdf,Readout Guidance: Learning Control from Diffusion Features,"We present Readout Guidance, a method for controlling text-to-image diffusion +models with learned signals. Readout Guidance uses readout heads, lightweight +networks trained to extract signals from the features of a pre-trained, frozen +diffusion model at every timestep. These readouts can encode single-image +properties, such as pose, depth, and edges; or higher-order properties that +relate multiple images, such as correspondence and appearance similarity. +Furthermore, by comparing the readout estimates to a user-defined target, and +back-propagating the gradient through the readout head, these estimates can be +used to guide the sampling process. Compared to prior methods for conditional +generation, Readout Guidance requires significantly fewer added parameters and +training samples, and offers a convenient and simple recipe for reproducing +different forms of conditional control under a single framework, with a single +architecture and sampling procedure. We showcase these benefits in the +applications of drag-based manipulation, identity-consistent generation, and +spatially aligned control. Project page: https://readout-guidance.github.io.",cs.CV,['cs.CV'] +BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection,Zhenxin Li · Shiyi Lan · Jose M. Alvarez · Zuxuan Wu, ,https://arxiv.org/abs/2312.01696v1,,2312.01696v1.pdf,BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection,"Recently, the rise of query-based Transformer decoders is reshaping +camera-based 3D object detection. These query-based decoders are surpassing the +traditional dense BEV (Bird's Eye View)-based methods. However, we argue that +dense BEV frameworks remain important due to their outstanding abilities in +depth estimation and object localization, depicting 3D scenes accurately and +comprehensively. This paper aims to address the drawbacks of the existing dense +BEV-based 3D object detectors by introducing our proposed enhanced components, +including a CRF-modulated depth estimation module enforcing object-level +consistencies, a long-term temporal aggregation module with extended receptive +fields, and a two-stage object decoder combining perspective techniques with +CRF-modulated depth embedding. These enhancements lead to a ""modernized"" dense +BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms +both BEV-based and query-based frameworks under various settings, achieving a +state-of-the-art result of 64.2 NDS on the nuScenes test set.",cs.CV,['cs.CV'] +It's All About Your Sketch: Democratising Sketch Control in Diffusion Models,Subhadeep Koley · Ayan Kumar Bhunia · Deeptanshu Sekhri · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/StableSketching/,https://arxiv.org/abs/2403.07234,,2403.07234.pdf,It's All About Your Sketch: Democratising Sketch Control in Diffusion Models,"This paper unravels the potential of sketches for diffusion models, +addressing the deceptive promise of direct sketch control in generative AI. We +importantly democratise the process, enabling amateur sketches to generate +precise images, living up to the commitment of ""what you sketch is what you +get"". A pilot study underscores the necessity, revealing that deformities in +existing models stem from spatial-conditioning. To rectify this, we propose an +abstraction-aware framework, utilising a sketch adapter, adaptive time-step +sampling, and discriminative guidance from a pre-trained fine-grained +sketch-based image retrieval model, working synergistically to reinforce +fine-grained sketch-photo association. Our approach operates seamlessly during +inference without the need for textual prompts; a simple, rough sketch akin to +what you and I can create suffices! We welcome everyone to examine results +presented in the paper and its supplementary. Contributions include +democratising sketch control, introducing an abstraction-aware framework, and +leveraging discriminative guidance, validated through extensive experiments.",cs.CV,['cs.CV'] +COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction,Qihang Ma · Xin Tan · Yanyun Qu · Lizhuang Ma · Zhizhong Zhang · Yuan Xie,https://github.com/NotACracker/COTR,https://arxiv.org/abs/2312.01919,,2312.01919.pdf,COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction,"The autonomous driving community has shown significant interest in 3D +occupancy prediction, driven by its exceptional geometric perception and +general object recognition capabilities. To achieve this, current works try to +construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation +extending from the Bird-Eye-View perception. However, compressed views like TPV +representation lose 3D geometry information while raw and sparse OCC +representation requires heavy but redundant computational costs. To address the +above limitations, we propose Compact Occupancy TRansformer (COTR), with a +geometry-aware occupancy encoder and a semantic-aware group decoder to +reconstruct a compact 3D OCC representation. The occupancy encoder first +generates a compact geometrical OCC feature through efficient explicit-implicit +view transformation. Then, the occupancy decoder further enhances the semantic +discriminability of the compact OCC representation by a coarse-to-fine semantic +grouping strategy. Empirical experiments show that there are evident +performance gains across multiple baselines, e.g., COTR outperforms baselines +with a relative improvement of 8%-15%, demonstrating the superiority of our +method.",cs.CV,['cs.CV'] +Global and Local Prompts Cooperation via Optimal Transport for Federated Learning,Hongxia Li · Wei Huang · Jingya Wang · Ye Shi,https://github.com/HongxiaLee/FedOTP,https://arxiv.org/abs/2403.00041,,2403.00041.pdf,Global and Local Prompts Cooperation via Optimal Transport for Federated Learning,"Prompt learning in pretrained visual-language models has shown remarkable +flexibility across various downstream tasks. Leveraging its inherent +lightweight nature, recent research attempted to integrate the powerful +pretrained models into federated learning frameworks to simultaneously reduce +communication costs and promote local training on insufficient data. Despite +these efforts, current federated prompt learning methods lack specialized +designs to systematically address severe data heterogeneities, e.g., data +distribution with both label and feature shifts involved. To address this +challenge, we present Federated Prompts Cooperation via Optimal Transport +(FedOTP), which introduces efficient collaborative prompt learning strategies +to capture diverse category traits on a per-client basis. Specifically, for +each client, we learn a global prompt to extract consensus knowledge among +clients, and a local prompt to capture client-specific category +characteristics. Unbalanced Optimal Transport is then employed to align local +visual features with these prompts, striking a balance between global consensus +and local personalization. By relaxing one of the equality constraints, FedOTP +enables prompts to focus solely on the core regions of image patches. Extensive +experiments on datasets with various types of heterogeneities have demonstrated +that our FedOTP outperforms the state-of-the-art methods.",cs.LG,"['cs.LG', 'cs.AI', 'cs.DC']" +Rethinking the Evaluation Protocol of Domain Generalization,Han Yu · Xingxuan Zhang · Renzhe Xu · Jiashuo Liu · Yue He · Peng Cui, ,https://arxiv.org/abs/2307.11108,,2307.11108.pdf,Flatness-Aware Minimization for Domain Generalization,"Domain generalization (DG) seeks to learn robust models that generalize well +under unknown distribution shifts. As a critical aspect of DG, optimizer +selection has not been explored in depth. Currently, most DG methods follow the +widely used benchmark, DomainBed, and utilize Adam as the default optimizer for +all datasets. However, we reveal that Adam is not necessarily the optimal +choice for the majority of current DG methods and datasets. Based on the +perspective of loss landscape flatness, we propose a novel approach, +Flatness-Aware Minimization for Domain Generalization (FAD), which can +efficiently optimize both zeroth-order and first-order flatness simultaneously +for DG. We provide theoretical analyses of the FAD's out-of-distribution (OOD) +generalization error and convergence. Our experimental results demonstrate the +superiority of FAD on various DG datasets. Additionally, we confirm that FAD is +capable of discovering flatter optima in comparison to other zeroth-order and +first-order flatness-aware optimization methods.",cs.CV,"['cs.CV', 'cs.LG']" +"The More You See in 2D, the More You Perceive in 3D",Xinyang Han · Zelin Gao · Angjoo Kanazawa · Shubham Goel · Yossi Gandelsman, ,https://arxiv.org/abs/2404.03652,,2404.03652.pdf,"The More You See in 2D, the More You Perceive in 3D","Humans can infer 3D structure from 2D images of an object based on past +experience and improve their 3D understanding as they see more images. Inspired +by this behavior, we introduce SAP3D, a system for 3D reconstruction and novel +view synthesis from an arbitrary number of unposed images. Given a few unposed +images of an object, we adapt a pre-trained view-conditioned diffusion model +together with the camera poses of the images via test-time fine-tuning. The +adapted diffusion model and the obtained camera poses are then utilized as +instance-specific priors for 3D reconstruction and novel view synthesis. We +show that as the number of input images increases, the performance of our +approach improves, bridging the gap between optimization-based prior-less 3D +reconstruction methods and single-image-to-3D diffusion-based methods. We +demonstrate our system on real images as well as standard synthetic benchmarks. +Our ablation studies confirm that this adaption behavior is key for more +accurate 3D understanding.",cs.CV,['cs.CV'] +"Selective, Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition",Filip Ilic · He Zhao · Thomas Pock · Richard P. Wildes,https://f-ilic.github.io/SelectivePrivacyPreservation,https://arxiv.org/abs/2403.12710,,2403.12710.pdf,"Selective, Interpretable, and Motion Consistent Privacy Attribute Obfuscation for Action Recognition","Concerns for the privacy of individuals captured in public imagery have led +to privacy-preserving action recognition. Existing approaches often suffer from +issues arising through obfuscation being applied globally and a lack of +interpretability. Global obfuscation hides privacy sensitive regions, but also +contextual regions important for action recognition. Lack of interpretability +erodes trust in these new technologies. We highlight the limitations of current +paradigms and propose a solution: Human selected privacy templates that yield +interpretability by design, an obfuscation scheme that selectively hides +attributes and also induces temporal consistency, which is important in action +recognition. Our approach is architecture agnostic and directly modifies input +imagery, while existing approaches generally require architecture training. Our +approach offers more flexibility, as no retraining is required, and outperforms +alternatives on three widely used datasets.",cs.CV,"['cs.CV', 'cs.LG']" +OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition,Yuchen Pan · Junjun Jiang · Kui Jiang · Zhihao Wu · Keyuan Yu · Xianming Liu, ,https://arxiv.org/abs/2402.18786,,2402.18786.pdf,OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition,"Depression Recognition (DR) poses a considerable challenge, especially in the +context of the growing concerns surrounding privacy. Traditional automatic +diagnosis of DR technology necessitates the use of facial images, undoubtedly +expose the patient identity features and poses privacy risks. In order to +mitigate the potential risks associated with the inappropriate disclosure of +patient facial images, we design a new imaging system to erase the identity +information of captured facial images while retain disease-relevant features. +It is irreversible for identity information recovery while preserving essential +disease-related characteristics necessary for accurate DR. More specifically, +we try to record a de-identified facial image (erasing the identifiable +features as much as possible) by a learnable lens, which is optimized in +conjunction with the following DR task as well as a range of face analysis +related auxiliary tasks in an end-to-end manner. These aforementioned +strategies form our final Optical deep Depression Recognition network +(OpticalDR). Experiments on CelebA, AVEC 2013, and AVEC 2014 datasets +demonstrate that our OpticalDR has achieved state-of-the-art privacy protection +performance with an average AUC of 0.51 on popular facial recognition models, +and competitive results for DR with MAE/RMSE of 7.53/8.48 on AVEC 2013 and +7.89/8.82 on AVEC 2014, respectively.",cs.CV,['cs.CV'] +NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis,Nilesh Kulkarni · Davis Rempe · Kyle Genova · Abhijit Kundu · Justin Johnson · David Fouhey · Leonidas Guibas, ,https://arxiv.org/abs/2307.07511,,2307.07511.pdf,NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis,"We address the problem of generating realistic 3D motions of humans +interacting with objects in a scene. Our key idea is to create a neural +interaction field attached to a specific object, which outputs the distance to +the valid interaction manifold given a human pose as input. This interaction +field guides the sampling of an object-conditioned human motion diffusion +model, so as to encourage plausible contacts and affordance semantics. To +support interactions with scarcely available data, we propose an automated +synthetic data pipeline. For this, we seed a pre-trained motion model, which +has priors for the basics of human movement, with interaction-specific anchor +poses extracted from limited motion capture data. Using our guided diffusion +model trained on generated synthetic data, we synthesize realistic motions for +sitting and lifting with several objects, outperforming alternative approaches +in terms of motion quality and successful action completion. We call our +framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.",cs.CV,['cs.CV'] +On The Vulnerability of Efficient Vision Transformers to Adversarial Computation Attacks,Navaneet K L · Soroush Abbasi Koohpayegani · Essam Sleiman · Hamed Pirsiavash, ,https://arxiv.org/html/2208.09602v2,,2208.09602v2.pdf,Exploring Adversarial Robustness of Vision Transformers in the Spectral Perspective,"The Vision Transformer has emerged as a powerful tool for image +classification tasks, surpassing the performance of convolutional neural +networks (CNNs). Recently, many researchers have attempted to understand the +robustness of Transformers against adversarial attacks. However, previous +researches have focused solely on perturbations in the spatial domain. This +paper proposes an additional perspective that explores the adversarial +robustness of Transformers against frequency-selective perturbations in the +spectral domain. To facilitate comparison between these two domains, an attack +framework is formulated as a flexible tool for implementing attacks on images +in the spatial and spectral domains. The experiments reveal that Transformers +rely more on phase and low frequency information, which can render them more +vulnerable to frequency-selective attacks than CNNs. This work offers new +insights into the properties and adversarial robustness of Transformers.",cs.CV,['cs.CV'] +Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation,Daichi Horita · Naoto Inoue · Kotaro Kikuchi · Kota Yamaguchi · Kiyoharu Aizawa,https://udonda.github.io/RALF/,https://arxiv.org/abs/2311.13602,,2311.13602.pdf,Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation,"Content-aware graphic layout generation aims to automatically arrange visual +elements along with a given content, such as an e-commerce product image. In +this paper, we argue that the current layout generation approaches suffer from +the limited training data for the high-dimensional layout structure. We show +that a simple retrieval augmentation can significantly improve the generation +quality. Our model, which is named Retrieval-Augmented Layout Transformer +(RALF), retrieves nearest neighbor layout examples based on an input image and +feeds these results into an autoregressive generator. Our model can apply +retrieval augmentation to various controllable generation tasks and yield +high-quality layouts within a unified architecture. Our extensive experiments +show that RALF successfully generates content-aware layouts in both constrained +and unconstrained settings and significantly outperforms the baselines.",cs.CV,['cs.CV'] +Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks,Shin'ya Yamaguchi · Sekitoshi Kanai · Kazuki Adachi · Daiki Chijiwa,https://github.com/yshinya6/adarand,https://arxiv.org/abs/2403.10097,,2403.10097.pdf,Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks,"While fine-tuning is a de facto standard method for training deep neural +networks, it still suffers from overfitting when using small target datasets. +Previous methods improve fine-tuning performance by maintaining knowledge of +the source datasets or introducing regularization terms such as contrastive +loss. However, these methods require auxiliary source information (e.g., source +labels or datasets) or heavy additional computations. In this paper, we propose +a simple method called adaptive random feature regularization (AdaRand). +AdaRand helps the feature extractors of training models to adaptively change +the distribution of feature vectors for downstream classification tasks without +auxiliary source information and with reasonable computation costs. To this +end, AdaRand minimizes the gap between feature vectors and random reference +vectors that are sampled from class conditional Gaussian distributions. +Furthermore, AdaRand dynamically updates the conditional distribution to follow +the currently updated feature extractors and balance the distance between +classes in feature spaces. Our experiments show that AdaRand outperforms the +other fine-tuning regularization, which requires auxiliary source information +and heavy computation costs.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +UFC-Net: Unrolling Fixed-point Continuous Network for Deep Compressive Sensing,Xiaoyang Wang · Hongping Gan, ,,https://link.springer.com/article/10.1007/s11263-023-01814-w,,,,,nan +Error Detection in Egocentric Procedural Task Videos,Shih-Po Lee · Zijia Lu · Zekun Zhang · Minh Hoai · Ehsan Elhamifar, ,https://arxiv.org/abs/2404.01933,,2404.01933.pdf,PREGO: online mistake detection in PRocedural EGOcentric videos,"Promptly identifying procedural errors from egocentric videos in an online +setting is highly challenging and valuable for detecting mistakes as soon as +they happen. This capability has a wide range of applications across various +fields, such as manufacturing and healthcare. The nature of procedural mistakes +is open-set since novel types of failures might occur, which calls for +one-class classifiers trained on correctly executed procedures. However, no +technique can currently detect open-set procedural mistakes online. We propose +PREGO, the first online one-class classification model for mistake detection in +PRocedural EGOcentric videos. PREGO is based on an online action recognition +component to model the current action, and a symbolic reasoning module to +predict the next actions. Mistake detection is performed by comparing the +recognized current action with the expected future one. We evaluate PREGO on +two procedural egocentric video datasets, Assembly101 and Epic-tent, which we +adapt for online benchmarking of procedural mistake detection to establish +suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, +respectively.",cs.CV,['cs.CV'] +Low-Rank Knowledge Decomposition for Medical Foundation Models,Yuhang Zhou · Haolin li · Siyuan Du · Jiangchao Yao · Ya Zhang · Yanfeng Wang, ,https://arxiv.org/abs/2404.17184,,2404.17184.pdf,Low-Rank Knowledge Decomposition for Medical Foundation Models,"The popularity of large-scale pre-training has promoted the development of +medical foundation models. However, some studies have shown that although +foundation models exhibit strong general feature extraction capabilities, their +performance on specific tasks is still inferior to task-specific methods. In +this paper, we explore a new perspective called ``Knowledge Decomposition'' to +improve the performance on specific medical tasks, which deconstruct the +foundation model into multiple lightweight expert models, each dedicated to a +particular task, with the goal of improving specialization while concurrently +mitigating resource expenditure. To accomplish the above objective, we design a +novel framework named Low-Rank Knowledge Decomposition (LoRKD), which +explicitly separates graidents by incorporating low-rank expert modules and the +efficient knowledge separation convolution. Extensive experimental results +demonstrate that the decomposed models perform well in terms of performance and +transferability, even surpassing the original foundation models.",cs.CV,['cs.CV'] +GS-IR: 3D Gaussian Splatting for Inverse Rendering,Zhihao Liang · Qi Zhang · Ying Feng · Ying Shan · Kui Jia, ,https://arxiv.org/abs/2311.16473,,2311.16473.pdf,GS-IR: 3D Gaussian Splatting for Inverse Rendering,"We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian +Splatting (GS) that leverages forward mapping volume rendering to achieve +photorealistic novel view synthesis and relighting results. Unlike previous +works that use implicit neural representations and volume rendering (e.g. +NeRF), which suffer from low expressive power and high computational +complexity, we extend GS, a top-performance representation for novel view +synthesis, to estimate scene geometry, surface material, and environment +illumination from multi-view images captured under unknown lighting conditions. +There are two main problems when introducing GS to inverse rendering: 1) GS +does not support producing plausible normal natively; 2) forward mapping (e.g. +rasterization and splatting) cannot trace the occlusion like backward mapping +(e.g. ray tracing). To address these challenges, our GS-IR proposes an +efficient optimization scheme that incorporates a depth-derivation-based +regularization for normal estimation and a baking-based occlusion to model +indirect lighting. The flexible and expressive GS representation allows us to +achieve fast and compact geometry reconstruction, photorealistic novel view +synthesis, and effective physically-based rendering. We demonstrate the +superiority of our method over baseline methods through qualitative and +quantitative evaluations on various challenging scenes.",cs.CV,['cs.CV'] +Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation,Yuanhong Chen · Yuyuan Liu · Hu Wang · Fengbei Liu · Chong Wang · Helen Frazer · Gustavo Carneiro, ,https://arxiv.org/abs/2310.18709,,2310.18709.pdf,Audio-Visual Instance Segmentation,"In this paper, we propose a new multi-modal task, namely audio-visual +instance segmentation (AVIS), in which the goal is to identify, segment, and +track individual sounding object instances in audible videos, simultaneously. +To our knowledge, it is the first time that instance segmentation has been +extended into the audio-visual domain. To better facilitate this research, we +construct the first audio-visual instance segmentation benchmark (AVISeg). +Specifically, AVISeg consists of 1,258 videos with an average duration of 62.6 +seconds from YouTube and public audio-visual datasets, where 117 videos have +been annotated by using an interactive semi-automatic labeling tool based on +the Segment Anything Model (SAM). In addition, we present a simple baseline +model for the AVIS task. Our new model introduces an audio branch and a +cross-modal fusion module to Mask2Former to locate all sounding objects. +Finally, we evaluate the proposed method using two backbones on AVISeg. We +believe that AVIS will inspire the community towards a more comprehensive +multi-modal understanding.",cs.CV,"['cs.CV', 'cs.LG', 'cs.MM', 'cs.SD', 'eess.AS']" +Towards Generalizable Multi-Object Tracking,Zheng Qin · Le Wang · Sanping Zhou · Panpan Fu · Gang Hua · Wei Tang, ,http://export.arxiv.org/abs/2311.10382,,2311.10382.pdf,Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking,"Multi-Object Tracking (MOT) remains a vital component of intelligent video +analysis, which aims to locate targets and maintain a consistent identity for +each target throughout a video sequence. Existing works usually learn a +discriminative feature representation, such as motion and appearance, to +associate the detections across frames, which are easily affected by mutual +occlusion and background clutter in practice. In this paper, we propose a +simple yet effective two-stage feature learning paradigm to jointly learn +single-shot and multi-shot features for different targets, so as to achieve +robust data association in the tracking process. For the detections without +being associated, we design a novel single-shot feature learning module to +extract discriminative features of each detection, which can efficiently +associate targets between adjacent frames. For the tracklets being lost several +frames, we design a novel multi-shot feature learning module to extract +discriminative features of each tracklet, which can accurately refind these +lost targets after a long period. Once equipped with a simple data association +logic, the resulting VisualTracker can perform robust MOT based on the +single-shot and multi-shot feature representations. Extensive experimental +results demonstrate that our method has achieved significant improvements on +MOT17 and MOT20 datasets while reaching state-of-the-art performance on +DanceTrack dataset.",cs.CV,['cs.CV'] +Authentic Hand Avatar from a Phone Scan via Universal Hand Model,Gyeongsik Moon · Weipeng Xu · Rohan Joshi · Chenglei Wu · Takaaki Shiratori, ,https://arxiv.org/abs/2405.07933,,2405.07933.pdf,Authentic Hand Avatar from a Phone Scan via Universal Hand Model,"The authentic 3D hand avatar with every identifiable information, such as +hand shapes and textures, is necessary for immersive experiences in AR/VR. In +this paper, we present a universal hand model (UHM), which 1) can universally +represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can +be adapted to each person with a short phone scan for the authentic hand +avatar. For effective universal hand modeling, we perform tracking and modeling +at the same time, while previous 3D hand models perform them separately. The +conventional separate pipeline suffers from the accumulated errors from the +tracking stage, which cannot be recovered in the modeling stage. On the other +hand, ours does not suffer from the accumulated errors while having a much more +concise overall pipeline. We additionally introduce a novel image matching loss +function to address a skin sliding during the tracking and modeling, while +existing works have not focused on it much. Finally, using learned priors from +our UHM, we effectively adapt our UHM to each person's short phone scan for the +authentic hand avatar.",cs.CV,['cs.CV'] +WANDR: Intention-guided Human Motion Generation,Markos Diomataris · Nikos Athanasiou · Omid Taheri · Xi Wang · Otmar Hilliges · Michael J. Black,https://wandr.is.tue.mpg.de/,https://arxiv.org/abs/2404.15383,,2404.15383.pdf,WANDR: Intention-guided Human Motion Generation,"Synthesizing natural human motions that enable a 3D human avatar to walk and +reach for arbitrary goals in 3D space remains an unsolved problem with many +applications. Existing methods (data-driven or using reinforcement learning) +are limited in terms of generalization and motion naturalness. A primary +obstacle is the scarcity of training data that combines locomotion with goal +reaching. To address this, we introduce WANDR, a data-driven model that takes +an avatar's initial pose and a goal's 3D position and generates natural human +motions that place the end effector (wrist) on the goal location. To solve +this, we introduce novel intention features that drive rich goal-oriented +movement. Intention guides the agent to the goal, and interactively adapts the +generation to novel situations without needing to define sub-goals or the +entire motion path. Crucially, intention allows training on datasets that have +goal-oriented motions as well as those that do not. WANDR is a conditional +Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE +datasets. We evaluate our method extensively and demonstrate its ability to +generate natural and long-term motions that reach 3D goals and generalize to +unseen goal locations. Our models and code are available for research purposes +at wandr.is.tue.mpg.de.",cs.CV,"['cs.CV', 'cs.AI']" +SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement,Tao Wang · Lei Jin · Zheng Wang · Jianshu Li · Liang Li · Fang Zhao · Yu Cheng · Li Yuan · Li ZHOU · Junliang Xing · Jian Zhao, ,https://arxiv.org/abs/2311.09543,,2311.09543.pdf,Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery,"Though significant progress in human pose and shape recovery from monocular +RGB images has been made in recent years, obtaining 3D human motion with high +accuracy and temporal consistency from videos remains challenging. Existing +video-based methods tend to reconstruct human motion from global image +features, which lack detailed representation capability and limit the +reconstruction accuracy. In this paper, we propose a Temporal-Aware Refining +Network (TAR), to synchronously explore temporal-aware global and local image +features for accurate pose and shape recovery. First, a global transformer +encoder is introduced to obtain temporal global features from static feature +sequences. Second, a bidirectional ConvGRU network takes the sequence of +high-resolution feature maps as input, and outputs temporal local feature maps +that maintain high resolution and capture the local motion of the human body. +Finally, a recurrent refinement module iteratively updates estimated SMPL +parameters by leveraging both global and local temporal information to achieve +accurate and smooth results. Extensive experiments demonstrate that our TAR +obtains more accurate results than previous state-of-the-art methods on popular +benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.",cs.CV,['cs.CV'] +vid-TLDR: Training Free Token merging for Light-weight Video Transformer,Joonmyung Choi · Sanghyeok Lee · Jaewon Chu · Minhyuk Choi · Hyunwoo J. Kim,https://github.com/mlvlab/vid-TLDR,https://arxiv.org/abs/2403.13347,,2403.13347.pdf,vid-TLDR: Training Free Token merging for Light-weight Video Transformer,"Video Transformers have become the prevalent solution for various video +downstream tasks with superior expressive power and flexibility. However, these +video transformers suffer from heavy computational costs induced by the massive +number of tokens across the entire video frames, which has been the major +barrier to training the model. Further, the patches irrelevant to the main +contents, e.g., backgrounds, degrade the generalization performance of models. +To tackle these issues, we propose training free token merging for lightweight +video Transformer (vid-TLDR) that aims to enhance the efficiency of video +Transformers by merging the background tokens without additional training. For +vid-TLDR, we introduce a novel approach to capture the salient regions in +videos only with the attention map. Further, we introduce the saliency-aware +token merging strategy by dropping the background tokens and sharpening the +object scores. Our experiments show that vid-TLDR significantly mitigates the +computational complexity of video Transformers while achieving competitive +performance compared to the base model without vid-TLDR. Code is available at +https://github.com/mlvlab/vid-TLDR.",cs.CV,['cs.CV'] +Boosting Image Restoration via Priors from Pre-trained Models,Xiaogang Xu · Shu Kong · Tao Hu · Zhe Liu · Hujun Bao, ,https://arxiv.org/abs/2403.06793,,2403.06793.pdf,Boosting Image Restoration via Priors from Pre-trained Models,"Pre-trained models with large-scale training data, such as CLIP and Stable +Diffusion, have demonstrated remarkable performance in various high-level +computer vision tasks such as image understanding and generation from language +descriptions. Yet, their potential for low-level tasks such as image +restoration remains relatively unexplored. In this paper, we explore such +models to enhance image restoration. As off-the-shelf features (OSF) from +pre-trained models do not directly serve image restoration, we propose to learn +an additional lightweight module called Pre-Train-Guided Refinement Module +(PTG-RM) to refine restoration results of a target restoration network with +OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying +Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention +(PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, +while PTG-CSA enhances spatial-channel attention for restoration-related +learning. Extensive experiments demonstrate that PTG-RM, with its compact size +($<$1M parameters), effectively enhances restoration performance of various +models across different tasks, including low-light enhancement, deraining, +deblurring, and denoising.",cs.CV,['cs.CV'] +HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting,Yuheng Jiang · Zhehao Shen · Penghao Wang · Zhuo Su · Yu Hong · Yingliang Zhang · Jingyi Yu · Lan Xu,https://nowheretrix.github.io/HiFi4G/,https://arxiv.org/abs/2312.03461,,2312.03461.pdf,HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting,"We have recently seen tremendous progress in photo-real human modeling and +rendering. Yet, efficiently rendering realistic human performance and +integrating it into the rasterization pipeline remains challenging. In this +paper, we present HiFi4G, an explicit and compact Gaussian-based approach for +high-fidelity human performance rendering from dense footage. Our core +intuition is to marry the 3D Gaussian representation with non-rigid tracking, +achieving a compact and compression-friendly representation. We first propose a +dual-graph mechanism to obtain motion priors, with a coarse deformation graph +for effective initialization and a fine-grained Gaussian graph to enforce +subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with +adaptive spatial-temporal regularizers to effectively balance the non-rigid +prior and Gaussian updating. We also present a companion compression scheme +with residual compensation for immersive experiences on various platforms. It +achieves a substantial compression rate of approximately 25 times, with less +than 2MB of storage per frame. Extensive experiments demonstrate the +effectiveness of our approach, which significantly outperforms existing +approaches in terms of optimization speed, rendering quality, and storage +overhead.",cs.CV,['cs.CV'] +Preserving Fairness Generalization in Deepfake Detection,Li Lin · Li Lin · Xinan He · Yan Ju · Xin Wang · Feng Ding · Shu Hu, ,https://arxiv.org/abs/2402.17229v1,,2402.17229v1.pdf,Preserving Fairness Generalization in Deepfake Detection,"Although effective deepfake detection models have been developed in recent +years, recent studies have revealed that these models can result in unfair +performance disparities among demographic groups, such as race and gender. This +can lead to particular groups facing unfair targeting or exclusion from +detection, potentially allowing misclassified deepfakes to manipulate public +opinion and undermine trust in the model. The existing method for addressing +this problem is providing a fair loss function. It shows good fairness +performance for intra-domain evaluation but does not maintain fairness for +cross-domain testing. This highlights the significance of fairness +generalization in the fight against deepfakes. In this work, we propose the +first method to address the fairness generalization problem in deepfake +detection by simultaneously considering features, loss, and optimization +aspects. Our method employs disentanglement learning to extract demographic and +domain-agnostic forgery features, fusing them to encourage fair learning across +a flattened loss landscape. Extensive experiments on prominent deepfake +datasets demonstrate our method's effectiveness, surpassing state-of-the-art +approaches in preserving fairness during cross-domain deepfake detection. The +code is available at https://github.com/Purdue-M2/Fairness-Generalization",cs.CV,"['cs.CV', 'cs.CY', 'cs.LG']" +CoSeR: Bridging Image and Language for Cognitive Super-Resolution,Haoze Sun · Wenbo Li · Jianzhuang Liu · Haoyu Chen · Renjing Pei · Xueyi Zou · Youliang Yan · Yujiu Yang, ,https://arxiv.org/abs/2311.16512,,2311.16512.pdf,CoSeR: Bridging Image and Language for Cognitive Super-Resolution,"Existing super-resolution (SR) models primarily focus on restoring local +texture details, often neglecting the global semantic information within the +scene. This oversight can lead to the omission of crucial semantic details or +the introduction of inaccurate textures during the recovery process. In our +work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering +SR models with the capacity to comprehend low-resolution images. We achieve +this by marrying image appearance and language understanding to generate a +cognitive embedding, which not only activates prior information from large +text-to-image diffusion models but also facilitates the generation of +high-quality reference images to optimize the SR process. To further improve +image fidelity, we propose a novel condition injection scheme called +""All-in-Attention"", consolidating all conditional information into a single +module. Consequently, our method successfully restores semantically correct and +photorealistic details, demonstrating state-of-the-art performance across +multiple benchmarks. Code: https://github.com/VINHYU/CoSeR",cs.CV,"['cs.CV', 'cs.AI']" +Task-Customized Mixture of Adapters for General Image Fusion,Pengfei Zhu · Yang Sun · Bing Cao · Qinghua Hu, ,https://arxiv.org/abs/2403.12494,,2403.12494.pdf,Task-Customized Mixture of Adapters for General Image Fusion,"General image fusion aims at integrating important information from +multi-source images. However, due to the significant cross-task gap, the +respective fusion mechanism varies considerably in practice, resulting in +limited performance across subtasks. To handle this problem, we propose a novel +task-customized mixture of adapters (TC-MoA) for general image fusion, +adaptively prompting various fusion tasks in a unified model. We borrow the +insight from the mixture of experts (MoE), taking the experts as efficient +tuning adapters to prompt a pre-trained foundation model. These adapters are +shared across different tasks and constrained by mutual information +regularization, ensuring compatibility with different tasks while +complementarity for multi-source images. The task-specific routing networks +customize these adapters to extract task-specific information from different +sources with dynamic dominant intensity, performing adaptive visual feature +prompt fusion. Notably, our TC-MoA controls the dominant intensity bias for +different fusion tasks, successfully unifying multiple fusion tasks in a single +model. Extensive experiments show that TC-MoA outperforms the competing +approaches in learning commonalities while retaining compatibility for general +image fusion (multi-modal, multi-exposure, and multi-focus), and also +demonstrating striking controllability on more generalization experiments. The +code is available at https://github.com/YangSun22/TC-MoA .",cs.CV,['cs.CV'] +Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation,Shuting He · Henghui Ding, ,https://arxiv.org/abs/2404.03645,,2404.03645.pdf,Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation,"Referring video segmentation relies on natural language expressions to +identify and segment objects, often emphasizing motion clues. Previous works +treat a sentence as a whole and directly perform identification at the +video-level, mixing up static image-level cues with temporal motion cues. +However, image-level features cannot well comprehend motion cues in sentences, +and static cues are not crucial for temporal perception. In fact, static cues +can sometimes interfere with temporal perception by overshadowing motion cues. +In this work, we propose to decouple video-level referring expression +understanding into static and motion perception, with a specific emphasis on +enhancing temporal comprehension. Firstly, we introduce an +expression-decoupling module to make static cues and motion cues perform their +distinct role, alleviating the issue of sentence embeddings overlooking motion +cues. Secondly, we propose a hierarchical motion perception module to capture +temporal information effectively across varying timescales. Furthermore, we +employ contrastive learning to distinguish the motions of visually similar +objects. These contributions yield state-of-the-art performance across five +datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement +on the challenging $\textbf{MeViS}$ dataset. Code is available at +https://github.com/heshuting555/DsHmp.",cs.CV,['cs.CV'] +MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark,Sanghyun Woo · Kwanyong Park · Inkyu Shin · Myungchul Kim · In So Kweon,https://sites.google.com/view/mtmmc,https://arxiv.org/abs/2403.20225,,2403.20225.pdf,MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark,"Multi-target multi-camera tracking is a crucial task that involves +identifying and tracking individuals over time using video streams from +multiple cameras. This task has practical applications in various fields, such +as visual surveillance, crowd behavior analysis, and anomaly detection. +However, due to the difficulty and cost of collecting and labeling data, +existing datasets for this task are either synthetically generated or +artificially constructed within a controlled camera network setting, which +limits their ability to model real-world dynamics and generalize to diverse +camera configurations. To address this issue, we present MTMMC, a real-world, +large-scale dataset that includes long video sequences captured by 16 +multi-modal cameras in two different environments - campus and factory - across +various time, weather, and season conditions. This dataset provides a +challenging test-bed for studying multi-camera tracking under diverse +real-world complexities and includes an additional input modality of spatially +aligned and temporally synchronized RGB and thermal cameras, which enhances the +accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets, +benefiting independent fields such as person detection, re-identification, and +multiple object tracking. We provide baselines and new learning setups on this +dataset and set the reference scores for future studies. The datasets, models, +and test server will be made publicly available.",cs.CV,['cs.CV'] +Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection,Zhanwei Zhang · Minghao Chen · Shuai Xiao · Liang Peng · Hengjia Li · Binbin Lin · Ping Li · Wenxiao Wang · Boxi Wu · Deng Cai, ,https://arxiv.org/abs/2404.19384,,2404.19384.pdf,Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection,"Recent self-training techniques have shown notable improvements in +unsupervised domain adaptation for 3D object detection (3D UDA). These +techniques typically select pseudo labels, i.e., 3D boxes, to supervise models +for the target domain. However, this selection process inevitably introduces +unreliable 3D boxes, in which 3D points cannot be definitively assigned as +foreground or background. Previous techniques mitigate this by reweighting +these boxes as pseudo labels, but these boxes can still poison the training +process. To resolve this problem, in this paper, we propose a novel pseudo +label refinery framework. Specifically, in the selection process, to improve +the reliability of pseudo boxes, we propose a complementary augmentation +strategy. This strategy involves either removing all points within an +unreliable box or replacing it with a high-confidence box. Moreover, the point +numbers of instances in high-beam datasets are considerably higher than those +in low-beam datasets, also degrading the quality of pseudo labels during the +training process. We alleviate this issue by generating additional proposals +and aligning RoI features across different domains. Experimental results +demonstrate that our method effectively enhances the quality of pseudo labels +and consistently surpasses the state-of-the-art methods on six autonomous +driving benchmarks. Code will be available at +https://github.com/Zhanwei-Z/PERE.",cs.CV,"['cs.CV', 'cs.AI']" +Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection,Yajing Liu · Shijun Zhou · Xiyao Liu · chunhui Hao · Baojie Fan · Jiandong Tian, ,https://arxiv.org/abs/2405.15225,,2405.15225.pdf,Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection,"Single-source domain generalization (SDG) for object detection is a +challenging yet essential task as the distribution bias of the unseen domain +degrades the algorithm performance significantly. However, existing methods +attempt to extract domain-invariant features, neglecting that the biased data +leads the network to learn biased features that are non-causal and poorly +generalizable. To this end, we propose an Unbiased Faster R-CNN (UFR) for +generalizable feature learning. Specifically, we formulate SDG in object +detection from a causal perspective and construct a Structural Causal Model +(SCM) to analyze the data bias and feature bias in the task, which are caused +by scene confounders and object attribute confounders. Based on the SCM, we +design a Global-Local Transformation module for data augmentation, which +effectively simulates domain diversity and mitigates the data bias. +Additionally, we introduce a Causal Attention Learning module that incorporates +a designed attention invariance loss to learn image-level features that are +robust to scene confounders. Moreover, we develop a Causal Prototype Learning +module with an explicit instance constraint and an implicit prototype +constraint, which further alleviates the negative impact of object attribute +confounders. Experimental results on five scenes demonstrate the prominent +generalization ability of our method, with an improvement of 3.9% mAP on the +Night-Clear scene.",cs.CV,['cs.CV'] +DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning,Sikai Bai · Jie ZHANG · Song Guo · Shuaicheng Li · Jingcai Guo · Jun Hou · Tao Han · Xiaocheng Lu, ,https://arxiv.org/abs/2403.08506,,2403.08506.pdf,DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning,"Federated learning (FL) has emerged as a powerful paradigm for learning from +decentralized data, and federated domain generalization further considers the +test dataset (target domain) is absent from the decentralized training data +(source domains). However, most existing FL methods assume that domain labels +are provided during training, and their evaluation imposes explicit constraints +on the number of domains, which must strictly match the number of clients. +Because of the underutilization of numerous edge devices and additional +cross-client domain annotations in the real world, such restrictions may be +impractical and involve potential privacy leaks. In this paper, we propose an +efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a +method that tackles the above restrictions by learning adaptive prompts for +domain generalization in a distributed manner. Specifically, we first design +two types of prompts, i.e., global prompt to capture general knowledge across +all clients and domain prompts to capture domain-specific knowledge. They +eliminate the restriction on the one-to-one mapping between source domains and +local clients. Furthermore, a dynamic query metric is introduced to +automatically search the suitable domain label for each sample, which includes +two-substep text-image alignments based on prompt tuning without +labor-intensive annotation. Extensive experiments on multiple datasets +demonstrate that our DiPrompT achieves superior domain generalization +performance over state-of-the-art FL methods when domain labels are not +provided, and even outperforms many centralized learning methods using domain +labels.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +HIVE: Harnessing Human Feedback for Instructional Visual Editing,Shu Zhang · Xinyi Yang · Yihao Feng · Can Qin · Chia-Chih Chen · Ning Yu · Zeyuan Chen · Huan Wang · Silvio Savarese · Stefano Ermon · Caiming Xiong · Ran Xu, ,,https://www.semanticscholar.org/paper/HQ-Edit:-A-High-Quality-Dataset-for-Image-Editing-Hui-Yang/09609bd28855fd9b27f043b4dbf509615229bd08,,,,,nan +LightIt: Illumination Modeling and Control for Diffusion Models,Peter Kocsis · Kalyan Sunkavalli · Julien Philip · Matthias Nießner · Yannick Hold-Geoffroy,https://peter-kocsis.github.io/LightIt/,https://arxiv.org/abs/2403.10615,,2403.10615.pdf,LightIt: Illumination Modeling and Control for Diffusion Models,"We introduce LightIt, a method for explicit illumination control for image +generation. Recent generative methods lack lighting control, which is crucial +to numerous artistic aspects of image generation such as setting the overall +mood or cinematic appearance. To overcome these limitations, we propose to +condition the generation on shading and normal maps. We model the lighting with +single bounce shading, which includes cast shadows. We first train a shading +estimation module to generate a dataset of real-world images and shading pairs. +Then, we train a control network using the estimated shading and normals as +input. Our method demonstrates high-quality image generation and lighting +control in numerous scenes. Additionally, we use our generated dataset to train +an identity-preserving relighting model, conditioned on an image and a target +shading. Our method is the first that enables the generation of images with +controllable, consistent lighting and performs on par with specialized +relighting state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG', 'I.4.8; I.2.10']" +Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping,Peng Sun · Xinyang Liu · Zhibo Wang · Bo Liu, ,,https://dl.acm.org/doi/abs/10.1145/3637494.3638729,,,,,nan +Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing,Bi'an Du · Xiang Gao · Wei Hu · Renjie Liao, ,https://arxiv.org/abs/2402.17464,,2402.17464.pdf,Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing,"Generative 3D part assembly involves understanding part relationships and +predicting their 6-DoF poses for assembling a realistic 3D shape. Prior work +often focus on the geometry of individual parts, neglecting part-whole +hierarchies of objects. Leveraging two key observations: 1) super-part poses +provide strong hints about part poses, and 2) predicting super-part poses is +easier due to fewer superparts, we propose a part-whole-hierarchy message +passing network for efficient 3D part assembly. We first introduce super-parts +by grouping geometrically similar parts without any semantic labels. Then we +employ a part-whole hierarchical encoder, wherein a super-part encoder predicts +latent super-part poses based on input parts. Subsequently, we transform the +point cloud using the latent poses, feeding it to the part encoder for +aggregating super-part information and reasoning about part relationships to +predict all part poses. In training, only ground-truth part poses are required. +During inference, the predicted latent poses of super-parts enhance +interpretability. Experimental results on the PartNet dataset show that our +method achieves state-of-the-art performance in part and connectivity accuracy +and enables an interpretable hierarchical part assembly. Code is available at +https://github.com/pkudba/3DHPA.",cs.CV,['cs.CV'] +FreeMan: Towards benchmarking 3D human pose estimation under Real-World Conditions,Jiong WANG · Fengyu Yang · Bingliang Li · Wenbo Gou · Danqi Yan · Ailing Zeng · Ailing Zeng · Yijun Gao · Junle Wang · Yanqing Jing · Ruimao Zhang,https://wangjiongw.github.io/freeman/,https://arxiv.org/abs/2309.05073,,2309.05073.pdf,FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions,"Estimating the 3D structure of the human body from natural scenes is a +fundamental aspect of visual perception. 3D human pose estimation is a vital +step in advancing fields like AIGC and human-robot interaction, serving as a +crucial technique for understanding and interacting with human actions in +real-world settings. However, the current datasets, often collected under +single laboratory conditions using complex motion capture equipment and +unvarying backgrounds, are insufficient. The absence of datasets on variable +conditions is stalling the progress of this crucial task. To facilitate the +development of 3D pose estimation, we present FreeMan, the first large-scale, +multi-view dataset collected under the real-world conditions. FreeMan was +captured by synchronizing 8 smartphones across diverse scenarios. It comprises +11M frames from 8000 sequences, viewed from different perspectives. These +sequences cover 40 subjects across 10 different scenarios, each with varying +lighting conditions. We have also established an semi-automated pipeline +containing error detection to reduce the workload of manual check and ensure +precise annotation. We provide comprehensive evaluation baselines for a range +of tasks, underlining the significant challenges posed by FreeMan. Further +evaluations of standard indoor/outdoor human sensing datasets reveal that +FreeMan offers robust representation transferability in real and complex +scenes. Code and data are available at https://wangjiongw.github.io/freeman.",cs.CV,['cs.CV'] +Generative Multimodal Models are In-Context Learners,Quan Sun · Yufeng Cui · Yufeng Cui · Xiaosong Zhang · Fan Zhang · Qiying Yu · Yueze Wang · Yongming Rao · Jingjing Liu · Tiejun Huang · Xinlong Wang, ,https://arxiv.org/abs/2312.13286,,2312.13286.pdf,Generative Multimodal Models are In-Context Learners,"The human ability to easily solve multimodal tasks in context (i.e., with +only a few demonstrations or simple instructions), is what current multimodal +systems have largely struggled to imitate. In this work, we demonstrate that +the task-agnostic in-context learning capabilities of large multimodal models +can be significantly enhanced by effective scaling-up. We introduce Emu2, a +generative multimodal model with 37 billion parameters, trained on large-scale +multimodal sequences with a unified autoregressive objective. Emu2 exhibits +strong multimodal in-context learning abilities, even emerging to solve tasks +that require on-the-fly reasoning, such as visual prompting and object-grounded +generation. The model sets a new record on multiple multimodal understanding +tasks in few-shot settings. When instruction-tuned to follow specific +instructions, Emu2 further achieves new state-of-the-art on challenging tasks +such as question answering benchmarks for large multimodal models and +open-ended subject-driven generation. These achievements demonstrate that Emu2 +can serve as a base model and general-purpose interface for a wide range of +multimodal tasks. Code and models are publicly available to facilitate future +research.",cs.CV,['cs.CV'] +SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction,Yuan Li · Zhihao Liu · Bedrich Benes · Xiaopeng Zhang · Jianwei Guo,https://github.com/RyuZhihao123/SVDTree,https://arxiv.org/abs/2402.12712,,2402.12712.pdf,MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction,"This paper presents a neural architecture MVDiffusion++ for 3D object +reconstruction that synthesizes dense and high-resolution views of an object +given one or a few images without camera poses. MVDiffusion++ achieves superior +flexibility and scalability with two surprisingly simple ideas: 1) A +``pose-free architecture'' where standard self-attention among 2D latent +features learns 3D consistency across an arbitrary number of conditional and +generation views without explicitly using camera pose information; and 2) A +``view dropout strategy'' that discards a substantial number of output views +during training, which reduces the training-time memory footprint and enables +dense and high-resolution view synthesis at test time. We use the Objaverse for +training and the Google Scanned Objects for evaluation with standard novel view +synthesis and 3D reconstruction metrics, where MVDiffusion++ significantly +outperforms the current state of the arts. We also demonstrate a text-to-3D +application example by combining MVDiffusion++ with a text-to-image generative +model. The project page is at https://mvdiffusion-plusplus.github.io.",cs.CV,['cs.CV'] +HomoFormer: Homogenized Transformer for Image Shadow Removal,Jie Xiao · Xueyang Fu · Yurui Zhu · Dong Li · Jie Huang · Kai Zhu · Zheng-Jun Zha, ,https://arxiv.org/abs/2404.18433,,2404.18433.pdf,ShadowMaskFormer: Mask Augmented Patch Embeddings for Shadow Removal,"Transformer recently emerged as the de facto model for computer vision tasks +and has also been successfully applied to shadow removal. However, these +existing methods heavily rely on intricate modifications to the attention +mechanisms within the transformer blocks while using a generic patch embedding. +As a result, it often leads to complex architectural designs requiring +additional computation resources. In this work, we aim to explore the efficacy +of incorporating shadow information within the early processing stage. +Accordingly, we propose a transformer-based framework with a novel patch +embedding that is tailored for shadow removal, dubbed ShadowMaskFormer. +Specifically, we present a simple and effective mask-augmented patch embedding +to integrate shadow information and promote the model's emphasis on acquiring +knowledge for shadow regions. Extensive experiments conducted on the ISTD, +ISTD+, and SRD benchmark datasets demonstrate the efficacy of our method +against state-of-the-art approaches while using fewer model parameters.",cs.CV,['cs.CV'] +Novel Class Discovery for Ultra-Fine-Grained Visual Categorization,Qi Jia · Yaqi Cai · Qi Jia · Binglin Qiu · Weimin Wang · Nan Pu,https://github.com/SSDUT-Caiyq/UFG-NCD,https://arxiv.org/abs/2405.06283,,2405.06283.pdf,Novel Class Discovery for Ultra-Fine-Grained Visual Categorization,"Ultra-fine-grained visual categorization (Ultra-FGVC) aims at distinguishing +highly similar sub-categories within fine-grained objects, such as different +soybean cultivars. Compared to traditional fine-grained visual categorization, +Ultra-FGVC encounters more hurdles due to the small inter-class and large +intra-class variation. Given these challenges, relying on human annotation for +Ultra-FGVC is impractical. To this end, our work introduces a novel task termed +Ultra-Fine-Grained Novel Class Discovery (UFG-NCD), which leverages partially +annotated data to identify new categories of unlabeled images for Ultra-FGVC. +To tackle this problem, we devise a Region-Aligned Proxy Learning (RAPL) +framework, which comprises a Channel-wise Region Alignment (CRA) module and a +Semi-Supervised Proxy Learning (SemiPL) strategy. The CRA module is designed to +extract and utilize discriminative features from local regions, facilitating +knowledge transfer from labeled to unlabeled classes. Furthermore, SemiPL +strengthens representation learning and knowledge transfer with proxy-guided +supervised learning and proxy-guided contrastive learning. Such techniques +leverage class distribution information in the embedding space, improving the +mining of subtle differences between labeled and unlabeled ultra-fine-grained +classes. Extensive experiments demonstrate that RAPL significantly outperforms +baselines across various datasets, indicating its effectiveness in handling the +challenges of UFG-NCD. Code is available at +https://github.com/SSDUT-Caiyq/UFG-NCD.",cs.CV,['cs.CV'] +RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method,Ming Yan · Yan Zhang · Shuqiang Cai · Shuqi Fan · Xincheng Lin · Yudi Dai · Siqi Shen · Chenglu Wen · Lan Xu · Yuexin Ma · Cheng Wang,http://www.lidarhumanmotion.net/reli11d/,https://arxiv.org/abs/2403.19501,,2403.19501.pdf,RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method,"Comprehensive capturing of human motions requires both accurate captures of +complex poses and precise localization of the human within scenes. Most of the +HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However, +solely using these modalities or a combination of them may not be adequate for +HPE, particularly for complex and fast movements. For holistic human motion +understanding, we present RELI11D, a high-quality multimodal human motion +dataset involves LiDAR, IMU system, RGB camera, and Event camera. It records +the motions of 10 actors performing 5 sports in 7 scenes, including 3.32 hours +of synchronized LiDAR point clouds, IMU measurement data, RGB videos and Event +steams. Through extensive experiments, we demonstrate that the RELI11D presents +considerable challenges and opportunities as it contains many rapid and complex +motions that require precise location. To address the challenge of integrating +different modalities, we propose LEIR, a multimodal baseline that effectively +utilizes LiDAR Point Cloud, Event stream, and RGB through our cross-attention +fusion strategy. We show that LEIR exhibits promising results for rapid motions +and daily motions and that utilizing the characteristics of multiple modalities +can indeed improve HPE performance. Both the dataset and source code will be +released publicly to the research community, fostering collaboration and +enabling further exploration in this field.",cs.CV,['cs.CV'] +Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance Primitives,Ronghui Li · Yuxiang Zhang · Yachao Zhang · Hongwen Zhang · Jie Guo · Yan Zhang · Yebin Liu · Xiu Li,https://li-ronghui.github.io/lodge,https://arxiv.org/abs/2403.10518,,2403.10518.pdf,Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives,"We propose Lodge, a network capable of generating extremely long dance +sequences conditioned on given music. We design Lodge as a two-stage coarse to +fine diffusion architecture, and propose the characteristic dance primitives +that possess significant expressiveness as intermediate representations between +two diffusion models. The first stage is global diffusion, which focuses on +comprehending the coarse-level music-dance correlation and production +characteristic dance primitives. In contrast, the second-stage is the local +diffusion, which parallelly generates detailed motion sequences under the +guidance of the dance primitives and choreographic rules. In addition, we +propose a Foot Refine Block to optimize the contact between the feet and the +ground, enhancing the physical realism of the motion. Our approach can +parallelly generate dance sequences of extremely long length, striking a +balance between global choreographic patterns and local motion quality and +expressiveness. Extensive experiments validate the efficacy of our method.",cs.CV,"['cs.CV', 'cs.GR', 'cs.SD', 'eess.AS']" +ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models,Meng-Li Shih · Wei-Chiu Ma · Lorenzo Boyice · Aleksander Holynski · Forrester Cole · Brian Curless · Janne Kontkanen, ,https://arxiv.org/abs/2401.00979,,2401.00979.pdf,3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands,"Neural radiance fields (NeRFs) are promising 3D representations for scenes, +objects, and humans. However, most existing methods require multi-view inputs +and per-scene training, which limits their real-life applications. Moreover, +current methods focus on single-subject cases, leaving scenes of interacting +hands that involve severe inter-hand occlusions and challenging view variations +remain unsolved. To tackle these issues, this paper proposes a generalizable +visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically, +given an image of interacting hands as input, our VA-NeRF first obtains a +mesh-based representation of hands and extracts their corresponding geometric +and textural features. Subsequently, a feature fusion module that exploits the +visibility of query points and mesh vertices is introduced to adaptively merge +features of both hands, enabling the recovery of features in unseen areas. +Additionally, our VA-NeRF is optimized together with a novel discriminator +within an adversarial learning paradigm. In contrast to conventional +discriminators that predict a single real/fake label for the synthesized image, +the proposed discriminator generates a pixel-wise visibility map, providing +fine-grained supervision for unseen areas and encouraging the VA-NeRF to +improve the visual quality of synthesized images. Experiments on the +Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms +conventional NeRFs significantly. Project Page: +\url{https://github.com/XuanHuang0/VANeRF}.",cs.CV,['cs.CV'] +"Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability, Composability, and Decomposability from Anatomy via Self-Supervision",Mohammad Reza Hosseinzadeh Taher · Michael Gotway · Jianming Liang, ,https://arxiv.org/abs/2404.15672,,2404.15672.pdf,"Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability, Composability, and Decomposability from Anatomy via Self-Supervision","Humans effortlessly interpret images by parsing them into part-whole +hierarchies; deep learning excels in learning multi-level feature spaces, but +they often lack explicit coding of part-whole relations, a prominent property +of medical imaging. To overcome this limitation, we introduce Adam-v2, a new +self-supervised learning framework extending Adam [79] by explicitly +incorporating part-whole hierarchies into its learning objectives through three +key branches: (1) Localizability, acquiring discriminative representations to +distinguish different anatomical patterns; (2) Composability, learning each +anatomical structure in a parts-to-whole manner; and (3) Decomposability, +comprehending each anatomical structure in a whole-to-parts manner. +Experimental results across 10 tasks, compared to 11 baselines in zero-shot, +few-shot transfer, and full fine-tuning settings, showcase Adam-v2's superior +performance over large-scale medical models and existing SSL methods across +diverse downstream tasks. The higher generality and robustness of Adam-v2's +representations originate from its explicit construction of hierarchies for +distinct anatomical structures from unlabeled medical images. Adam-v2 preserves +a semantic balance of anatomical diversity and harmony in its embedding, +yielding representations that are both generic and semantically meaningful, yet +overlooked in existing SSL methods. All code and pretrained models are +available at https://github.com/JLiangLab/Eden.",cs.CV,['cs.CV'] +Robust Synthetic-to-Real Transfer for Stereo Matching,Jiawei Zhang · Jiahe Li · Lei Huang · Xiaohan Yu · Lin Gu · Jin Zheng · Xiao Bai, ,https://arxiv.org/abs/2403.07705,,2403.07705.pdf,Robust Synthetic-to-Real Transfer for Stereo Matching,"With advancements in domain generalized stereo matching networks, models +pre-trained on synthetic data demonstrate strong robustness to unseen domains. +However, few studies have investigated the robustness after fine-tuning them in +real-world scenarios, during which the domain generalization ability can be +seriously degraded. In this paper, we explore fine-tuning stereo matching +networks without compromising their robustness to unseen domains. Our +motivation stems from comparing Ground Truth (GT) versus Pseudo Label (PL) for +fine-tuning: GT degrades, but PL preserves the domain generalization ability. +Empirically, we find the difference between GT and PL implies valuable +information that can regularize networks during fine-tuning. We also propose a +framework to utilize this difference for fine-tuning, consisting of a frozen +Teacher, an exponential moving average (EMA) Teacher, and a Student network. +The core idea is to utilize the EMA Teacher to measure what the Student has +learned and dynamically improve GT and PL for fine-tuning. We integrate our +framework with state-of-the-art networks and evaluate its effectiveness on +several real-world datasets. Extensive experiments show that our method +effectively preserves the domain generalization ability during fine-tuning.",cs.CV,['cs.CV'] +Towards Robust Learning to Optimize with Theoretical Guarantees,Qingyu Song · Wei Lin · Juncheng Wang · Hong Xu,https://github.com/NetX-lab/GoMathL2O-Official,,https://henryhxu.github.io/papers.html,,,,,nan +UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs,Yanwu Xu · Yang Zhao · Zhisheng Xiao · Tingbo Hou, ,https://arxiv.org/abs/2311.09257,,2311.09257.pdf,UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs,"Text-to-image diffusion models have demonstrated remarkable capabilities in +transforming textual prompts into coherent images, yet the computational cost +of their inference remains a persistent challenge. To address this issue, we +present UFOGen, a novel generative model designed for ultra-fast, one-step +text-to-image synthesis. In contrast to conventional approaches that focus on +improving samplers or employing distillation techniques for diffusion models, +UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN +objective. Leveraging a newly introduced diffusion-GAN objective and +initialization with pre-trained diffusion models, UFOGen excels in efficiently +generating high-quality images conditioned on textual descriptions in a single +step. Beyond traditional text-to-image generation, UFOGen showcases versatility +in applications. Notably, UFOGen stands among the pioneering models enabling +one-step text-to-image generation and diverse downstream tasks, presenting a +significant advancement in the landscape of efficient generative models.",cs.CV,['cs.CV'] +Instance-aware Contrastive Learning for Occluded Human Mesh Reconstruction,Mi-Gyeong Gwon · Gi-Mun Um · Won-Sik Cheong · Wonjun Kim,https://github.com/DCVL-3D/InstanceHMR_release,https://arxiv.org/abs/2307.16377,,2307.16377.pdf,JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery,"In this study, we focus on the problem of 3D human mesh recovery from a +single image under obscured conditions. Most state-of-the-art methods aim to +improve 2D alignment technologies, such as spatial averaging and 2D joint +sampling. However, they tend to neglect the crucial aspect of 3D alignment by +improving 3D representations. Furthermore, recent methods struggle to separate +the target human from occlusion or background in crowded scenes as they +optimize the 3D space of target human with 3D joint coordinates as local +supervision. To address these issues, a desirable method would involve a +framework for fusing 2D and 3D features and a strategy for optimizing the 3D +space globally. Therefore, this paper presents 3D JOint contrastive learning +with TRansformers (JOTR) framework for handling occluded 3D human mesh +recovery. Our method includes an encoder-decoder transformer architecture to +fuse 2D and 3D representations for achieving 2D$\&$3D aligned results in a +coarse-to-fine manner and a novel 3D joint contrastive learning approach for +adding explicitly global supervision for the 3D feature space. The contrastive +learning approach includes two contrastive losses: joint-to-joint contrast for +enhancing the similarity of semantically similar voxels (i.e., human joints), +and joint-to-non-joint contrast for ensuring discrimination from others (e.g., +occlusions and background). Qualitative and quantitative analyses demonstrate +that our method outperforms state-of-the-art competitors on both +occlusion-specific and standard benchmarks, significantly improving the +reconstruction of occluded humans.",cs.CV,['cs.CV'] +CCEdit: Creative and Controllable Video Editing via Diffusion Models,Ruoyu Feng · Wenming Weng · Yanhui Wang · Yuhui Yuan · Jianmin Bao · Chong Luo · Zhibo Chen · Baining Guo, ,https://arxiv.org/abs/2309.16496,,2309.16496.pdf,CCEdit: Creative and Controllable Video Editing via Diffusion Models,"In this paper, we present CCEdit, a versatile generative video editing +framework based on diffusion models. Our approach employs a novel trident +network structure that separates structure and appearance control, ensuring +precise and creative editing capabilities. Utilizing the foundational +ControlNet architecture, we maintain the structural integrity of the video +during editing. The incorporation of an additional appearance branch enables +users to exert fine-grained control over the edited key frame. These two side +branches seamlessly integrate into the main branch, which is constructed upon +existing text-to-image (T2I) generation models, through learnable temporal +layers. The versatility of our framework is demonstrated through a diverse +range of choices in both structure representations and personalized T2I models, +as well as the option to provide the edited key frame. To facilitate +comprehensive evaluation, we introduce the BalanceCC benchmark dataset, +comprising 100 videos and 4 target prompts for each video. Our extensive user +studies compare CCEdit with eight state-of-the-art video editing methods. The +outcomes demonstrate CCEdit's substantial superiority over all other methods.",cs.CV,['cs.CV'] +CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images,Guanlin Shen · Jingwei Huang · Zhihua Hu · Bin Wang,https://github.com/SerCharles/CN-RMA,https://arxiv.org/abs/2403.04198,,2403.04198.pdf,CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images,"This paper introduces CN-RMA, a novel approach for 3D indoor object detection +from multi-view images. We observe the key challenge as the ambiguity of image +and 3D correspondence without explicit geometry to provide occlusion +information. To address this issue, CN-RMA leverages the synergy of 3D +reconstruction networks and 3D object detection networks, where the +reconstruction network provides a rough Truncated Signed Distance Function +(TSDF) and guides image features to vote to 3D space correctly in an end-to-end +manner. Specifically, we associate weights to sampled points of each ray +through ray marching, representing the contribution of a pixel in an image to +corresponding 3D locations. Such weights are determined by the predicted signed +distances so that image features vote only to regions near the reconstructed +surface. Our method achieves state-of-the-art performance in 3D object +detection from multi-view images, as measured by mAP@0.25 and mAP@0.5 on the +ScanNet and ARKitScenes datasets. The code and models are released at +https://github.com/SerCharles/CN-RMA.",cs.CV,['cs.CV'] +One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning,Pei-Kai Huang · Cheng-Hsuan Chiang · Tzu-Hsien Chen · Jun-Xiong Chong · Tyng-Luh Liu · Chiou-Ting Hsu, ,,https://link.springer.com/article/10.1007/s11042-023-17739-y,,,,,nan +PanoRecon: Real-Time Panoptic 3D Reconstruction from Monocular Video,Dong Wu · Zike Yan · Hongbin Zha, ,,,,,,,nan +Grounding Everything: Emerging Localization Properties in Vision-Language Transformers,Walid Bousselham · Felix Petersen · Vittorio Ferrari · Hilde Kuehne, ,https://arxiv.org/abs/2312.00878,,2312.00878.pdf,Grounding Everything: Emerging Localization Properties in Vision-Language Transformers,"Vision-language foundation models have shown remarkable performance in +various zero-shot settings such as image retrieval, classification, or +captioning. But so far, those models seem to fall behind when it comes to +zero-shot localization of referential expressions and objects in images. As a +result, they need to be fine-tuned for this task. In this paper, we show that +pretrained vision-language (VL) models allow for zero-shot open-vocabulary +object localization without any fine-tuning. To leverage those capabilities, we +propose a Grounding Everything Module (GEM) that generalizes the idea of +value-value attention introduced by CLIPSurgery to a self-self attention path. +We show that the concept of self-self attention corresponds to clustering, thus +enforcing groups of tokens arising from the same object to be similar while +preserving the alignment with the language space. To further guide the group +formation, we propose a set of regularizations that allows the model to finally +generalize across datasets and backbones. We evaluate the proposed GEM +framework on various benchmark tasks and datasets for semantic segmentation. It +shows that GEM not only outperforms other training-free open-vocabulary +localization methods, but also achieves state-of-the-art results on the +recently proposed OpenImagesV7 large-scale segmentation benchmark.",cs.CV,"['cs.CV', 'cs.AI']" +Brain Decodes Deep Nets,Huzheng Yang · James Gee · Jianbo Shi,https://huzeyann.github.io/brain-decodes-deep-nets,https://arxiv.org/abs/2312.01280,,2312.01280.pdf,Brain Decodes Deep Nets,"We developed a tool for visualizing and analyzing large pre-trained vision +models by mapping them onto the brain, thus exposing their hidden inside. Our +innovation arises from a surprising usage of brain encoding: predicting brain +fMRI measurements in response to images. We report two findings. First, +explicit mapping between the brain and deep-network features across dimensions +of space, layers, scales, and channels is crucial. This mapping method, +FactorTopy, is plug-and-play for any deep-network; with it, one can paint a +picture of the network onto the brain (literally!). Second, our visualization +shows how different training methods matter: they lead to remarkable +differences in hierarchical organization and scaling behavior, growing with +more data or network capacity. It also provides insight into fine-tuning: how +pre-trained models change when adapting to small datasets. We found brain-like +hierarchically organized network suffer less from catastrophic forgetting after +fine-tuned.",cs.CV,['cs.CV'] +Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion,Jiangtong Tan · Jie Huang · Naishan Zheng · Man Zhou · Keyu Yan · Danfeng Hong · Feng Zhao, ,,https://ieeexplore.ieee.org/document/10443302,,,,,nan +Prompting Vision Foundation Models for Pathology Image Analysis,CHONG YIN · Siqi Liu · Kaiyang Zhou · Vincent Wong · Pong C. Yuen, ,https://arxiv.org/abs/2403.16497,,2403.16497.pdf,PathoTune: Adapting Visual Foundation Model to Pathological Specialists,"As natural image understanding moves towards the pretrain-finetune era, +research in pathology imaging is concurrently evolving. Despite the predominant +focus on pretraining pathological foundation models, how to adapt foundation +models to downstream tasks is little explored. For downstream adaptation, we +propose the existence of two domain gaps, i.e., the Foundation-Task Gap and the +Task-Instance Gap. To mitigate these gaps, we introduce PathoTune, a framework +designed to efficiently adapt pathological or even visual foundation models to +pathology-specific tasks via multi-modal prompt tuning. The proposed framework +leverages Task-specific Visual Prompts and Task-specific Textual Prompts to +identify task-relevant features, along with Instance-specific Visual Prompts +for encoding single pathological image features. Results across multiple +datasets at both patch-level and WSI-level demonstrate its superior performance +over single-modality prompt tuning approaches. Significantly, PathoTune +facilitates the direct adaptation of natural visual foundation models to +pathological tasks, drastically outperforming pathological foundation models +with simple linear probing. The code will be available upon acceptance.",cs.CV,"['cs.CV', 'cs.LG']" +DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization,Jiahe Li · Jiawei Zhang · Xiao Bai · Jin Zheng · Xin Ning · Jun Zhou · Lin Gu,https://fictionarry.github.io/DNGaussian/,https://arxiv.org/abs/2403.06912,,2403.06912.pdf,DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization,"Radiance fields have demonstrated impressive performance in synthesizing +novel views from sparse input views, yet prevailing methods suffer from high +training costs and slow inference speed. This paper introduces DNGaussian, a +depth-regularized framework based on 3D Gaussian radiance fields, offering +real-time and high-quality few-shot novel view synthesis at low costs. Our +motivation stems from the highly efficient representation and surprising +quality of the recent 3D Gaussian Splatting, despite it will encounter a +geometry degradation when input views decrease. In the Gaussian radiance +fields, we find this degradation in scene geometry primarily lined to the +positioning of Gaussian primitives and can be mitigated by depth constraint. +Consequently, we propose a Hard and Soft Depth Regularization to restore +accurate scene geometry under coarse monocular depth supervision while +maintaining a fine-grained color appearance. To further refine detailed +geometry reshaping, we introduce Global-Local Depth Normalization, enhancing +the focus on small local depth changes. Extensive experiments on LLFF, DTU, and +Blender datasets demonstrate that DNGaussian outperforms state-of-the-art +methods, achieving comparable or better results with significantly reduced +memory cost, a $25 \times$ reduction in training time, and over $3000 \times$ +faster rendering speed.",cs.CV,['cs.CV'] +Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,Xiao Ma · Sumit Patidar · Iain Haughton · Stephen James,https://yusufma03.github.io/projects/hdp/,https://arxiv.org/abs/2403.03890v1,,2403.03890v1.pdf,Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,"This paper introduces Hierarchical Diffusion Policy (HDP), a hierarchical +agent for multi-task robotic manipulation. HDP factorises a manipulation policy +into a hierarchical structure: a high-level task-planning agent which predicts +a distant next-best end-effector pose (NBP), and a low-level goal-conditioned +diffusion policy which generates optimal motion trajectories. The factorised +policy representation allows HDP to tackle both long-horizon task planning +while generating fine-grained low-level actions. To generate context-aware +motion trajectories while satisfying robot kinematics constraints, we present a +novel kinematics-aware goal-conditioned control agent, Robot Kinematics +Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the +end-effector pose and joint position trajectories, and distill the accurate but +kinematics-unaware end-effector pose diffuser to the kinematics-aware but less +accurate joint position diffuser via differentiable kinematics. Empirically, we +show that HDP achieves a significantly higher success rate than the +state-of-the-art methods in both simulation and real-world.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG']" +Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition,Yifei Chen · Dapeng Chen · Ruijin Liu · Sai Zhou · Wenyuan Xue · Wei Peng, ,https://arxiv.org/abs/2311.15619,,2311.15619.pdf,Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition,"Large-scale visual-language pre-trained models have achieved significant +success in various video tasks. However, most existing methods follow an ""adapt +then align"" paradigm, which adapts pre-trained image encoders to model +video-level representations and utilizes one-hot or text embedding of the +action labels for supervision. This paradigm overlooks the challenge of mapping +from static images to complicated activity concepts. In this paper, we propose +a novel ""Align before Adapt"" (ALT) paradigm. Prior to adapting to video +representation learning, we exploit the entity-to-region alignments for each +frame. The alignments are fulfilled by matching the region-aware image +embeddings to an offline-constructed text corpus. With the aligned entities, we +feed their text embeddings to a transformer-based video adapter as the queries, +which can help extract the semantics of the most important entities from a +video to a vector. This paradigm reuses the visual-language alignment of VLP +during adaptation and tries to explain an action by the underlying entities. +This helps understand actions by bridging the gap with complex activity +semantics, particularly when facing unfamiliar or unseen categories. ALT +demonstrates competitive performance while maintaining remarkably low +computational costs. In fully supervised experiments, it achieves 88.1% top-1 +accuracy on Kinetics-400 with only 4947 GFLOPs. Moreover, ALT outperforms the +previous state-of-the-art methods in both zero-shot and few-shot experiments, +emphasizing its superior generalizability across various learning scenarios.",cs.CV,"['cs.CV', 'cs.AI']" +Discriminative Probing and Tuning for Text-to-Image Generation,Leigang Qu · Wenjie Wang · Yongqi Li · Hanwang Zhang · Liqiang Nie · Tat-seng Chua,https://dpt-t2i.github.io/,https://arxiv.org/abs/2403.04321,,2403.04321.pdf,Discriminative Probing and Tuning for Text-to-Image Generation,"Despite advancements in text-to-image generation (T2I), prior methods often +face text-image misalignment problems such as relation confusion in generated +images. Existing solutions involve cross-attention manipulation for better +compositional understanding or integrating large language models for improved +layout planning. However, the inherent alignment capabilities of T2I models are +still inadequate. By reviewing the link between generative and discriminative +modeling, we posit that T2I models' discriminative abilities may reflect their +text-image alignment proficiency during generation. In this light, we advocate +bolstering the discriminative abilities of T2I models to achieve more precise +text-to-image alignment for generation. We present a discriminative adapter +built on T2I models to probe their discriminative abilities on two +representative tasks and leverage discriminative fine-tuning to improve their +text-image alignment. As a bonus of the discriminative adapter, a +self-correction mechanism can leverage discriminative gradients to better align +generated images to text prompts during inference. Comprehensive evaluations +across three benchmark datasets, including both in-distribution and +out-of-distribution scenarios, demonstrate our method's superior generation +performance. Meanwhile, it achieves state-of-the-art discriminative performance +on the two discriminative tasks compared to other generative models.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.MM']" +CLIP-KD: An Empirical Study of CLIP Model Distillation,Chuanguang Yang · Zhulin An · Libo Huang · Junyu Bi · XinQiang Yu · Han Yang · boyu diao · Yongjun Xu, ,https://arxiv.org/abs/2307.12732,,2307.12732.pdf,CLIP-KD: An Empirical Study of CLIP Model Distillation,"Contrastive Language-Image Pre-training (CLIP) has become a promising +language-supervised visual pre-training framework. This paper aims to distill +small CLIP models supervised by a large teacher CLIP model. We propose several +distillation strategies, including relation, feature, gradient and contrastive +paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We +show that a simple feature mimicry with Mean Squared Error loss works +surprisingly well. Moreover, interactive contrastive learning across teacher +and student encoders is also effective in performance improvement. We explain +that the success of CLIP-KD can be attributed to maximizing the feature +similarity between teacher and student. The unified method is applied to +distill several student models trained on CC3M+12M. CLIP-KD improves student +CLIP models consistently over zero-shot ImageNet classification and cross-modal +retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the +teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy +over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% +and 20.1\% margins, respectively. Our code is released on +https://github.com/winycg/CLIP-KD.",cs.CV,['cs.CV'] +FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation,Shuai Yang · Yifan Zhou · Ziwei Liu · Chen Change Loy,https://www.mmlab-ntu.com/project/fresco/,https://arxiv.org/abs/2403.12962,,2403.12962.pdf,FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation,"The remarkable efficacy of text-to-image diffusion models has motivated +extensive exploration of their potential application in video domains. +Zero-shot methods seek to extend image diffusion models to videos without +necessitating model training. Recent methods mainly focus on incorporating +inter-frame correspondence into attention mechanisms. However, the soft +constraint imposed on determining where to attend to valid features can +sometimes be insufficient, resulting in temporal inconsistency. In this paper, +we introduce FRESCO, intra-frame correspondence alongside inter-frame +correspondence to establish a more robust spatial-temporal constraint. This +enhancement ensures a more consistent transformation of semantically similar +content across frames. Beyond mere attention guidance, our approach involves an +explicit update of features to achieve high spatial-temporal consistency with +the input video, significantly improving the visual coherence of the resulting +translated videos. Extensive experiments demonstrate the effectiveness of our +proposed framework in producing high-quality, coherent videos, marking a +notable improvement over existing zero-shot methods.",cs.CV,['cs.CV'] +XFibrosis: Explicit Vessel-Fiber Modeling for Fibrosis Staging from Liver Pathology Images,CHONG YIN · Siqi Liu · Fei Lyu · Jiahao Lu · Sune Darkner · Vincent Wong · Pong C. Yuen, ,,https://www.youtube.com/watch?v=_Yiu5g71ZHo,,,,,nan +LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model,Dongkai Wang · shiyu xuan · Shiliang Zhang, ,https://arxiv.org/abs/2310.00582,,2310.00582.pdf,Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs,"Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities +in various multi-modal tasks. Nevertheless, their performance in fine-grained +image understanding tasks is still limited. To address this issue, this paper +proposes a new framework to enhance the fine-grained image understanding +abilities of MLLMs. Specifically, we present a new method for constructing the +instruction tuning dataset at a low cost by leveraging annotations in existing +datasets. A self-consistent bootstrapping method is also introduced to extend +existing dense object annotations into high-quality +referring-expression-bounding-box pairs. These methods enable the generation of +high-quality instruction data which includes a wide range of fundamental +abilities essential for fine-grained image perception. Moreover, we argue that +the visual encoder should be tuned during instruction tuning to mitigate the +gap between full image perception and fine-grained image perception. +Experimental results demonstrate the superior performance of our method. For +instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA +and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also +attained the top rank on the leaderboard of MMBench. This promising performance +is achieved by training on only publicly available data, making it easily +reproducible. The models, datasets, and codes are publicly available at +https://github.com/SY-Xuan/Pink.",cs.CV,"['cs.CV', 'cs.AI']" +Producing and Leveraging Online Map Uncertainty in Trajectory Prediction,Xunjiang Gu · Guanyu Song · Igor Gilitschenski · Marco Pavone · Boris Ivanovic,https://github.com/alfredgu001324/MapUncertaintyPrediction,https://arxiv.org/abs/2403.16439v1,,2403.16439v1.pdf,Producing and Leveraging Online Map Uncertainty in Trajectory Prediction,"High-definition (HD) maps have played an integral role in the development of +modern autonomous vehicle (AV) stacks, albeit with high associated labeling and +maintenance costs. As a result, many recent works have proposed methods for +estimating HD maps online from sensor data, enabling AVs to operate outside of +previously-mapped regions. However, current online map estimation approaches +are developed in isolation of their downstream tasks, complicating their +integration in AV stacks. In particular, they do not produce uncertainty or +confidence estimates. In this work, we extend multiple state-of-the-art online +map estimation methods to additionally estimate uncertainty and show how this +enables more tightly integrating online mapping with trajectory forecasting. In +doing so, we find that incorporating uncertainty yields up to 50% faster +training convergence and up to 15% better prediction performance on the +real-world nuScenes driving dataset.",cs.RO,"['cs.RO', 'cs.CV', 'cs.LG']" +GenZI: Zero-Shot 3D Human-Scene Interaction Generation,Lei Li · Angela Dai,https://craigleili.github.io/projects/genzi/,https://arxiv.org/abs/2311.17737,,2311.17737.pdf,GenZI: Zero-Shot 3D Human-Scene Interaction Generation,"Can we synthesize 3D humans interacting with scenes without learning from any +3D human-scene interaction data? We propose GenZI, the first zero-shot approach +to generating 3D human-scene interactions. Key to GenZI is our distillation of +interaction priors from large vision-language models (VLMs), which have learned +a rich semantic space of 2D human-scene compositions. Given a natural language +description and a coarse point location of the desired interaction in a 3D +scene, we first leverage VLMs to imagine plausible 2D human interactions +inpainted into multiple rendered views of the scene. We then formulate a robust +iterative optimization to synthesize the pose and shape of a 3D human model in +the scene, guided by consistency with the 2D interaction hypotheses. In +contrast to existing learning-based approaches, GenZI circumvents the +conventional need for captured 3D interaction data, and allows for flexible +control of the 3D interaction synthesis with easy-to-use text prompts. +Extensive experiments show that our zero-shot approach has high flexibility and +generality, making it applicable to diverse scene types, including both indoor +and outdoor environments.",cs.CV,"['cs.CV', 'cs.GR']" +LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging,Haoyang Ge · Qiao Feng · Hailong Jia · Xiongzheng Li · Xiangjun Yin · You Zhou · Jingyu Yang · Kun Li,https://cic.tju.edu.cn/faculty/likun/projects/LPSNet/index.html,https://arxiv.org/abs/2404.01941,,2404.01941.pdf,LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging,"Human pose and shape (HPS) estimation with lensless imaging is not only +beneficial to privacy protection but also can be used in covert surveillance +scenarios due to the small size and simple structure of this device. However, +this task presents significant challenges due to the inherent ambiguity of the +captured measurements and lacks effective methods for directly estimating human +pose and shape from lensless data. In this paper, we propose the first +end-to-end framework to recover 3D human poses and shapes from lensless +measurements to our knowledge. We specifically design a multi-scale lensless +feature decoder to decode the lensless measurements through the optically +encoded mask for efficient feature extraction. We also propose a double-head +auxiliary supervision mechanism to improve the estimation accuracy of human +limb ends. Besides, we establish a lensless imaging system and verify the +effectiveness of our method on various datasets acquired by our lensless +imaging system.",cs.CV,['cs.CV'] +Can I Trust Your Answer? Visually Grounded Video Question Answering,Junbin Xiao · Angela Yao · Yicong Li · Tat-seng Chua, ,https://arxiv.org/abs/2309.01327,,2309.01327.pdf,Can I Trust Your Answer? Visually Grounded Video Question Answering,"We study visually grounded VideoQA in response to the emerging trends of +utilizing pretraining techniques for video-language understanding. +Specifically, by forcing vision-language models (VLMs) to answer questions and +simultaneously provide visual evidence, we seek to ascertain the extent to +which the predictions of such techniques are genuinely anchored in relevant +video content, versus spurious correlations from language or irrelevant visual +context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with +10.5$K$ temporal grounding (or location) labels tied to the original QA pairs. +With NExT-GQA, we scrutinize a series of state-of-the-art VLMs. Through +post-hoc attention analysis, we find that these models are extremely weak in +substantiating the answers despite their strong QA performance. This exposes +the limitation of current VLMs in making reliable predictions. As a remedy, we +further explore and propose a grounded-QA method via Gaussian mask optimization +and cross-modal learning. Experiments with different backbones demonstrate that +this grounding mechanism improves both grounding and QA. With these efforts, we +aim to push towards trustworthy VLMs in VQA systems. Our dataset and code are +available at https://github.com/doc-doc/NExT-GQA.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']" +RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction,Baptiste Brument · Robin Bruneau · Yvain Queau · Jean Mélou · Francois Lauze · Jean-Denis Durou · Lilian Calvet,https://robinbruneau.github.io/publications/rnb_neus.html,https://arxiv.org/abs/2312.01215,,2312.01215.pdf,RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction,"This paper introduces a versatile paradigm for integrating multi-view +reflectance (optional) and normal maps acquired through photometric stereo. Our +approach employs a pixel-wise joint re-parameterization of reflectance and +normal, considering them as a vector of radiances rendered under simulated, +varying illumination. This re-parameterization enables the seamless integration +of reflectance and normal maps as input data in neural volume rendering-based +3D reconstruction while preserving a single optimization objective. In +contrast, recent multi-view photometric stereo (MVPS) methods depend on +multiple, potentially conflicting objectives. Despite its apparent simplicity, +our proposed approach outperforms state-of-the-art approaches in MVPS +benchmarks across F-score, Chamfer distance, and mean angular error metrics. +Notably, it significantly improves the detailed 3D reconstruction of areas with +high curvature or low visibility.",cs.CV,['cs.CV'] +Multimodal Sense-Informed Prediction of 3D Human Motions,Zhenyu Lou · Qiongjie Cui · Haofan Wang · Xu Tang · Hong Zhou, ,https://arxiv.org/abs/2405.02911,,2405.02911.pdf,Multimodal Sense-Informed Prediction of 3D Human Motions,"Predicting future human pose is a fundamental application for machine +intelligence, which drives robots to plan their behavior and paths ahead of +time to seamlessly accomplish human-robot collaboration in real-world 3D +scenarios. Despite encouraging results, existing approaches rarely consider the +effects of the external scene on the motion sequence, leading to pronounced +artifacts and physical implausibilities in the predictions. To address this +limitation, this work introduces a novel multi-modal sense-informed motion +prediction approach, which conditions high-fidelity generation on two modal +information: external 3D scene, and internal human gaze, and is able to +recognize their salience for future human activity. Furthermore, the gaze +information is regarded as the human intention, and combined with both motion +and scene features, we construct a ternary intention-aware attention to +supervise the generation to match where the human wants to reach. Meanwhile, we +introduce semantic coherence-aware attention to explicitly distinguish the +salient point clouds and the underlying ones, to ensure a reasonable +interaction of the generated sequence with the 3D scene. On two real-world +benchmarks, the proposed method achieves state-of-the-art performance both in +3D human pose and trajectory prediction.",cs.CV,['cs.CV'] +SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge,Andong Wang · Bo Wu · Sunli Chen · Zhenfang Chen · Haotian Guan · Wei-Ning Lee · Li Erran Li · Chuang Gan, ,https://arxiv.org/abs/2405.09713,,2405.09713.pdf,SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge,"Learning commonsense reasoning from visual contexts and scenes in real-world +is a crucial step toward advanced artificial intelligence. However, existing +video reasoning benchmarks are still inadequate since they were mainly designed +for factual or situated reasoning and rarely involve broader knowledge in the +real world. Our work aims to delve deeper into reasoning evaluations, +specifically within dynamic, open-world, and structured context knowledge. We +propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K +situations with instance-level annotations depicted in the videos. The +reasoning process is required to understand and apply situated knowledge and +general knowledge for problem-solving. To create such a dataset, we propose an +automatic and scalable generation method to generate question-answer pairs, +knowledge graphs, and rationales by instructing the combinations of LLMs and +MLLMs. Concretely, we first extract observable situated entities, relations, +and processes from videos for situated knowledge and then extend to open-world +knowledge beyond the visible content. The task generation is facilitated +through multiple dialogues as iterations and subsequently corrected and refined +by our designed self-promptings and demonstrations. With a corpus of both +explicit situated facts and implicit commonsense, we generate associated +question-answer pairs and reasoning processes, finally followed by manual +reviews for quality assurance. We evaluated recent mainstream large +vision-language models on the benchmark and found several insightful +conclusions. For more information, please refer to our benchmark at +www.bobbywu.com/SOKBench.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +SURE: SUrvey REcipes for building reliable and robust deep networks,Yuting Li · Yingyi Chen · Xuanlong Yu · Dexiong Chen · Xi Shen,https://yutingli0606.github.io/SURE/,https://arxiv.org/abs/2403.00543,,2403.00543.pdf,SURE: SUrvey REcipes for building reliable and robust deep networks,"In this paper, we revisit techniques for uncertainty estimation within deep +neural networks and consolidate a suite of techniques to enhance their +reliability. Our investigation reveals that an integrated application of +diverse techniques--spanning model regularization, classifier and +optimization--substantially improves the accuracy of uncertainty predictions in +image classification tasks. The synergistic effect of these techniques +culminates in our novel SURE approach. We rigorously evaluate SURE against the +benchmark of failure prediction, a critical testbed for uncertainty estimation +efficacy. Our results showcase a consistently better performance than models +that individually deploy each technique, across various datasets and model +architectures. When applied to real-world challenges, such as data corruption, +label noise, and long-tailed class distribution, SURE exhibits remarkable +robustness, delivering results that are superior or on par with current +state-of-the-art specialized methods. Particularly on Animal-10N and Food-101N +for learning with noisy labels, SURE achieves state-of-the-art performance +without any task-specific adjustments. This work not only sets a new benchmark +for robust uncertainty estimation but also paves the way for its application in +diverse, real-world scenarios where reliability is paramount. Our code is +available at \url{https://yutingli0606.github.io/SURE/}.",cs.CV,['cs.CV'] +"ShapeMatcher: Self-Supervised Joint Shape Canonicalization, Segmentation, Retrieval and Deformation",Yan Di · Chenyangguang Zhang · Chaowei Wang · Ruida Zhang · Guangyao Zhai · Yanyan Li · Bowen Fu · Xiangyang Ji · Shan Gao, ,https://arxiv.org/abs/2311.11106,,2311.11106.pdf,"ShapeMatcher: Self-Supervised Joint Shape Canonicalization, Segmentation, Retrieval and Deformation","In this paper, we present ShapeMatcher, a unified self-supervised learning +framework for joint shape canonicalization, segmentation, retrieval and +deformation. Given a partially-observed object in an arbitrary pose, we first +canonicalize the object by extracting point-wise affine-invariant features, +disentangling inherent structure of the object with its pose and size. These +learned features are then leveraged to predict semantically consistent part +segmentation and corresponding part centers. Next, our lightweight retrieval +module aggregates the features within each part as its retrieval token and +compare all the tokens with source shapes from a pre-established database to +identify the most geometrically similar shape. Finally, we deform the retrieved +shape in the deformation module to tightly fit the input object by harnessing +part center guided neural cage deformation. The key insight of ShapeMaker is +the simultaneous training of the four highly-associated processes: +canonicalization, segmentation, retrieval, and deformation, leveraging +cross-task consistency losses for mutual supervision. Extensive experiments on +synthetic datasets PartNet, ComplementMe, and real-world dataset Scan2CAD +demonstrate that ShapeMaker surpasses competitors by a large margin.",cs.CV,['cs.CV'] +DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars,Tobias Kirschstein · Simon Giebenhain · Matthias Nießner,https://tobias-kirschstein.github.io/diffusion-avatars/,https://arxiv.org/abs/2311.18635,,2311.18635.pdf,DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars,"DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person, +offering intuitive control over both pose and expression. We propose a +diffusion-based neural renderer that leverages generic 2D priors to produce +compelling images of faces. For coarse guidance of the expression and head +pose, we render a neural parametric head model (NPHM) from the target +viewpoint, which acts as a proxy geometry of the person. Additionally, to +enhance the modeling of intricate facial expressions, we condition +DiffusionAvatars directly on the expression codes obtained from NPHM via +cross-attention. Finally, to synthesize consistent surface details across +different viewpoints and expressions, we rig learnable spatial features to the +head's surface via TriPlane lookup in NPHM's canonical space. We train +DiffusionAvatars on RGB videos and corresponding fitted NPHM meshes of a person +and test the obtained avatars in both self-reenactment and animation scenarios. +Our experiments demonstrate that DiffusionAvatars generates temporally +consistent and visually appealing videos for novel poses and expressions of a +person, outperforming existing approaches.",cs.CV,['cs.CV'] +PREGO: online mistake detection in PRocedural EGOcentric videos,Alessandro Flaborea · Guido M. D&#x27;Amely di Melendugno · Leonardo Plini · Luca Scofano · Edoardo De Matteis · Antonino Furnari · Giovanni Maria Farinella · Fabio Galasso,https://github.com/aleflabo/PREGO,https://arxiv.org/abs/2404.01933,,,PREGO: online mistake detection in PRocedural EGOcentric videos,"Promptly identifying procedural errors from egocentric videos in an online +setting is highly challenging and valuable for detecting mistakes as soon as +they happen. This capability has a wide range of applications across various +fields, such as manufacturing and healthcare. The nature of procedural mistakes +is open-set since novel types of failures might occur, which calls for +one-class classifiers trained on correctly executed procedures. However, no +technique can currently detect open-set procedural mistakes online. We propose +PREGO, the first online one-class classification model for mistake detection in +PRocedural EGOcentric videos. PREGO is based on an online action recognition +component to model the current action, and a symbolic reasoning module to +predict the next actions. Mistake detection is performed by comparing the +recognized current action with the expected future one. We evaluate PREGO on +two procedural egocentric video datasets, Assembly101 and Epic-tent, which we +adapt for online benchmarking of procedural mistake detection to establish +suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, +respectively.",cs.CV,['cs.CV'] +TEA: Test-time Energy Adaptation,Yige Yuan · Bingbing Xu · Liang Hou · Fei Sun · Huawei Shen · Xueqi Cheng, ,https://arxiv.org/abs/2311.14402,,2311.14402.pdf,TEA: Test-time Energy Adaptation,"Test-time adaptation (TTA) aims to improve model generalizability when test +data diverges from training distribution, offering the distinct advantage of +not requiring access to training data and processes, especially valuable in the +context of large pre-trained models. However, current TTA methods fail to +address the fundamental issue: covariate shift, i.e., the decreased +generalizability can be attributed to the model's reliance on the marginal +distribution of the training data, which may impair model calibration and +introduce confirmation bias. To address this, we propose a novel energy-based +perspective, enhancing the model's perception of target data distributions +without requiring access to training data or processes. Building on this +perspective, we introduce $\textbf{T}$est-time $\textbf{E}$nergy +$\textbf{A}$daptation ($\textbf{TEA}$), which transforms the trained classifier +into an energy-based model and aligns the model's distribution with the test +data's, enhancing its ability to perceive test distributions and thus improving +overall generalizability. Extensive experiments across multiple tasks, +benchmarks and architectures demonstrate TEA's superior generalization +performance against state-of-the-art methods. Further in-depth analyses reveal +that TEA can equip the model with a comprehensive perception of test +distribution, ultimately paving the way toward improved generalization and +calibration.",cs.LG,['cs.LG'] +A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network,Ruichen Ma · Guanchao Qiao · Yian Liu · Liwei Meng · Ning Ning · Yang Liu · Shaogang Hu,https://github.com/Ruichen0424/AB-BNN,https://arxiv.org/abs/2403.03739,,2403.03739.pdf,A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network,"Binary neural networks utilize 1-bit quantized weights and activations to +reduce both the model's storage demands and computational burden. However, +advanced binary architectures still incorporate millions of inefficient and +nonhardware-friendly full-precision multiplication operations. A&B BNN is +proposed to directly remove part of the multiplication operations in a +traditional BNN and replace the rest with an equal number of bit operations, +introducing the mask layer and the quantized RPReLU structure based on the +normalizer-free network architecture. The mask layer can be removed during +inference by leveraging the intrinsic characteristics of BNN with +straightforward mathematical transformations to avoid the associated +multiplication operations. The quantized RPReLU structure enables more +efficient bit operations by constraining its slope to be integer powers of 2. +Experimental results achieved 92.30%, 69.35%, and 66.89% on the CIFAR-10, +CIFAR-100, and ImageNet datasets, respectively, which are competitive with the +state-of-the-art. Ablation studies have verified the efficacy of the quantized +RPReLU structure, leading to a 1.14% enhancement on the ImageNet compared to +using a fixed slope RLeakyReLU. The proposed add&bit-operation-only BNN offers +an innovative approach for hardware-friendly network architecture.",cs.LG,"['cs.LG', 'cs.AI']" +"Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges",Tongtong Yuan · Xuange Zhang · Kun Liu · Bo Liu · Chen Chen · Jian Jin · Zhenzhen Jiao, ,https://arxiv.org/abs/2309.13925,,2309.13925.pdf,"Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges","Surveillance videos are an essential component of daily life with various +critical applications, particularly in public security. However, current +surveillance video tasks mainly focus on classifying and localizing anomalous +events. Existing methods are limited to detecting and classifying the +predefined events with unsatisfactory semantic understanding, although they +have obtained considerable performance. To address this issue, we propose a new +research direction of surveillance video-and-language understanding, and +construct the first multimodal surveillance video dataset. We manually annotate +the real-world surveillance dataset UCF-Crime with fine-grained event content +and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains +23,542 sentences, with an average length of 20 words, and its annotated videos +are as long as 110.7 hours. Furthermore, we benchmark SOTA models for four +multimodal tasks on this newly created dataset, which serve as new baselines +for surveillance video-and-language understanding. Through our experiments, we +find that mainstream models used in previously publicly available datasets +perform poorly on surveillance video, which demonstrates the new challenges in +surveillance video-and-language understanding. To validate the effectiveness of +our UCA, we conducted experiments on multimodal anomaly detection. The results +demonstrate that our multimodal surveillance learning can improve the +performance of conventional anomaly detection tasks. All the experiments +highlight the necessity of constructing this dataset to advance surveillance +AI. The link to our dataset is provided at: +https://xuange923.github.io/Surveillance-Video-Understanding.",cs.CV,"['cs.CV', 'cs.AI']" +Validating Privacy-Preserving Face Recognition under a Minimum Assumption,Hui Zhang · Xingbo Dong · YenLungLai · Ying Zhou · Xiaoyan ZHANG · Xingguo Lv · Zhe Jin · Xuejun Li, ,https://arxiv.org/abs/2403.12457,,2403.12457.pdf,Privacy-Preserving Face Recognition Using Trainable Feature Subtraction,"The widespread adoption of face recognition has led to increasing privacy +concerns, as unauthorized access to face images can expose sensitive personal +information. This paper explores face image protection against viewing and +recovery attacks. Inspired by image compression, we propose creating a visually +uninformative face image through feature subtraction between an original face +and its model-produced regeneration. Recognizable identity features within the +image are encouraged by co-training a recognition model on its high-dimensional +feature representation. To enhance privacy, the high-dimensional representation +is crafted through random channel shuffling, resulting in randomized +recognizable images devoid of attacker-leverageable texture details. We distill +our methodologies into a novel privacy-preserving face recognition method, +MinusFace. Experiments demonstrate its high recognition accuracy and effective +privacy protection. Its code is available at https://github.com/Tencent/TFace.",cs.CV,['cs.CV'] +One-Shot Open Affordance Learning with Foundation Models,Gen Li · Deqing Sun · Laura Sevilla-Lara · Varun Jampani, ,https://arxiv.org/abs/2311.17776v1,,2311.17776v1.pdf,One-Shot Open Affordance Learning with Foundation Models,"We introduce One-shot Open Affordance Learning (OOAL), where a model is +trained with just one example per base object category, but is expected to +identify novel objects and affordances. While vision-language models excel at +recognizing novel objects and scenes, they often struggle to understand finer +levels of granularity such as affordances. To handle this issue, we conduct a +comprehensive analysis of existing foundation models, to explore their inherent +understanding of affordances and assess the potential for data-limited +affordance learning. We then propose a vision-language framework with simple +and effective designs that boost the alignment between visual features and +affordance text embeddings. Experiments on two affordance segmentation +benchmarks show that the proposed method outperforms state-of-the-art models +with less than 1% of the full training data, and exhibits reasonable +generalization capability on unseen objects and affordances.",cs.CV,['cs.CV'] +Automatic Controllable Colorization via Imagination,Xiaoyan Cong · Yue Wu · Qifeng Chen · Chenyang Lei, ,https://arxiv.org/abs/2404.05661,,2404.05661.pdf,Automatic Controllable Colorization via Imagination,"We propose a framework for automatic colorization that allows for iterative +editing and modifications. The core of our framework lies in an imagination +module: by understanding the content within a grayscale image, we utilize a +pre-trained image generation model to generate multiple images that contain the +same content. These images serve as references for coloring, mimicking the +process of human experts. As the synthesized images can be imperfect or +different from the original grayscale image, we propose a Reference Refinement +Module to select the optimal reference composition. Unlike most previous +end-to-end automatic colorization algorithms, our framework allows for +iterative and localized modifications of the colorization results because we +explicitly model the coloring samples. Extensive experiments demonstrate the +superiority of our framework over existing automatic colorization algorithms in +editability and flexibility. Project page: +https://xy-cong.github.io/imagine-colorization.",cs.CV,['cs.CV'] +GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection,Xiaotian Li · Baojie Fan · Jiandong Tian · Huijie Fan, ,https://arxiv.org/abs/2309.11804,,2309.11804.pdf,FGFusion: Fine-Grained Lidar-Camera Fusion for 3D Object Detection,"Lidars and cameras are critical sensors that provide complementary +information for 3D detection in autonomous driving. While most prevalent +methods progressively downscale the 3D point clouds and camera images and then +fuse the high-level features, the downscaled features inevitably lose low-level +detailed information. In this paper, we propose Fine-Grained Lidar-Camera +Fusion (FGFusion) that make full use of multi-scale features of image and point +cloud and fuse them in a fine-grained way. First, we design a dual pathway +hierarchy structure to extract both high-level semantic and low-level detailed +features of the image. Second, an auxiliary network is introduced to guide +point cloud features to better learn the fine-grained spatial information. +Finally, we propose multi-scale fusion (MSF) to fuse the last N feature maps of +image and point cloud. Extensive experiments on two popular autonomous driving +benchmarks, i.e. KITTI and Waymo, demonstrate the effectiveness of our method.",cs.CV,['cs.CV'] +Open Vocabulary Semantic Scene Sketch Understanding,Ahmed Bourouis · Judith Fan · Yulia Gryaditskaya,https://ahmedbourouis.github.io/Scene_Sketch_Segmentation/,https://arxiv.org/abs/2312.12463,,2312.12463.pdf,Open Vocabulary Semantic Scene Sketch Understanding,"We study the underexplored but fundamental vision problem of machine +understanding of abstract freehand scene sketches. We introduce a sketch +encoder that results in semantically-aware feature space, which we evaluate by +testing its performance on a semantic sketch segmentation task. To train our +model we rely only on the availability of bitmap sketches with their brief +captions and do not require any pixel-level annotations. To obtain +generalization to a large set of sketches and categories, we build on a vision +transformer encoder pretrained with the CLIP model. We freeze the text encoder +and perform visual-prompt tuning of the visual encoder branch while introducing +a set of critical modifications. Firstly, we augment the classical key-query +(k-q) self-attention blocks with value-value (v-v) self-attention blocks. +Central to our model is a two-level hierarchical network design that enables +efficient semantic disentanglement: The first level ensures holistic scene +sketch encoding, and the second level focuses on individual categories. We, +then, in the second level of the hierarchy, introduce a cross-attention between +textual and visual branches. Our method outperforms zero-shot CLIP pixel +accuracy of segmentation results by 37 points, reaching an accuracy of $85.5\%$ +on the FS-COCO sketch dataset. Finally, we conduct a user study that allows us +to identify further improvements needed over our method to reconcile machine +and human understanding of scene sketches.",cs.CV,['cs.CV'] +View From Above: Orthogonal viewpoint aware Cross-view Localization,Shan Wang · Chuong Nguyen · Jiawei Liu · Yanhao Zhang · Sundaram Muthu · Fahira Afzal Maken · Kaihao Zhang · Hongdong Li, ,https://arxiv.org/abs/2308.08110,,2308.08110.pdf,View Consistent Purification for Accurate Cross-View Localization,"This paper proposes a fine-grained self-localization method for outdoor +robotics that utilizes a flexible number of onboard cameras and readily +accessible satellite images. The proposed method addresses limitations in +existing cross-view localization methods that struggle to handle noise sources +such as moving objects and seasonal variations. It is the first sparse +visual-only method that enhances perception in dynamic environments by +detecting view-consistent key points and their corresponding deep features from +ground and satellite views, while removing off-the-ground objects and +establishing homography transformation between the two views. Moreover, the +proposed method incorporates a spatial embedding approach that leverages camera +intrinsic and extrinsic information to reduce the ambiguity of purely visual +matching, leading to improved feature matching and overall pose estimation +accuracy. The method exhibits strong generalization and is robust to +environmental changes, requiring only geo-poses as ground truth. Extensive +experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that +our proposed method outperforms existing state-of-the-art methods, achieving +median spatial accuracy errors below $0.5$ meters along the lateral and +longitudinal directions, and a median orientation accuracy error below 2 +degrees.",cs.CV,['cs.CV'] +OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation,Jisoo Jeong · Hong Cai · Risheek Garrepalli · Jamie Lin · Munawar Hayat · Fatih Porikli, ,https://arxiv.org/abs/2403.18092,,2403.18092.pdf,OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation,"The scarcity of ground-truth labels poses one major challenge in developing +optical flow estimation models that are both generalizable and robust. While +current methods rely on data augmentation, they have yet to fully exploit the +rich information available in labeled video sequences. We propose OCAI, a +method that supports robust frame interpolation by generating intermediate +video frames alongside optical flows in between. Utilizing a forward warping +approach, OCAI employs occlusion awareness to resolve ambiguities in pixel +values and fills in missing values by leveraging the forward-backward +consistency of optical flows. Additionally, we introduce a teacher-student +style semi-supervised learning method on top of the interpolated frames. Using +a pair of unlabeled frames and the teacher model's predicted optical flow, we +generate interpolated frames and flows to train a student model. The teacher's +weights are maintained using Exponential Moving Averaging of the student. Our +evaluations demonstrate perceptually superior interpolation quality and +enhanced optical flow accuracy on established benchmarks such as Sintel and +KITTI.",cs.CV,['cs.CV'] +GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs,Gege Gao · Weiyang Liu · Anpei Chen · Andreas Geiger · Bernhard Schölkopf, ,https://arxiv.org/abs/2312.00093,,2312.00093.pdf,GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs,"As pretrained text-to-image diffusion models become increasingly powerful, +recent efforts have been made to distill knowledge from these text-to-image +pretrained models for optimizing a text-guided 3D model. Most of the existing +methods generate a holistic 3D model from a plain text input. This can be +problematic when the text describes a complex scene with multiple objects, +because the vectorized text embeddings are inherently unable to capture a +complex description with multiple entities and relationships. Holistic 3D +modeling of the entire scene further prevents accurate grounding of text +entities and concepts. To address this limitation, we propose GraphDreamer, a +novel framework to generate compositional 3D scenes from scene graphs, where +objects are represented as nodes and their interactions as edges. By exploiting +node and edge information in scene graphs, our method makes better use of the +pretrained text-to-image diffusion model and is able to fully disentangle +different objects without image-level supervision. To facilitate modeling of +object-wise relationships, we use signed distance fields as representation and +impose a constraint to avoid inter-penetration of objects. To avoid manual +scene graph creation, we design a text prompt for ChatGPT to generate scene +graphs based on text inputs. We conduct both qualitative and quantitative +experiments to validate the effectiveness of GraphDreamer in generating +high-fidelity compositional 3D scenes with disentangled object entities.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning,Wenjin Hou · Shiming Chen · Shuhuang Chen · Ziming Hong · Yan Wang · Xuetao Feng · Salman Khan · Fahad Shahbaz Khan · Xinge You, ,https://arxiv.org/abs/2404.14808v1,,2404.14808v1.pdf,Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning,"Generative Zero-shot learning (ZSL) learns a generator to synthesize visual +samples for unseen classes, which is an effective way to advance ZSL. However, +existing generative methods rely on the conditions of Gaussian noise and the +predefined semantic prototype, which limit the generator only optimized on +specific seen classes rather than characterizing each visual instance, +resulting in poor generalizations (\textit{e.g.}, overfitting to seen classes). +To address this issue, we propose a novel Visual-Augmented Dynamic Semantic +prototype method (termed VADS) to boost the generator to learn accurate +semantic-visual mapping by fully exploiting the visual-augmented knowledge into +semantic conditions. In detail, VADS consists of two modules: (1) Visual-aware +Domain Knowledge Learning module (VDKL) learns the local bias and global prior +of the visual features (referred to as domain visual knowledge), which replace +pure Gaussian noise to provide richer prior noise information; (2) +Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype +according to the visual representations of the samples. Ultimately, we +concatenate their output as a dynamic semantic prototype, which serves as the +condition of the generator. Extensive experiments demonstrate that our VADS +achieves superior CZSL and GZSL performances on three prominent datasets and +outperforms other state-of-the-art methods with averaging increases by 6.4\%, +5.9\% and 4.2\% on SUN, CUB and AWA2, respectively.",cs.CV,['cs.CV'] +EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Priors,Zhipeng Hu · Minda Zhao · Chaoyi Zhao · Xinyue Liang · Lincheng Li · Zeng Zhao · Changjie Fan · Xiaowei Zhou · Xin Yu, ,https://arxiv.org/abs/2308.13223,,2308.13223.pdf,EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior,"While image diffusion models have made significant progress in text-driven 3D +content creation, they often fail to accurately capture the intended meaning of +text prompts, especially for view information. This limitation leads to the +Janus problem, where multi-faced 3D models are generated under the guidance of +such diffusion models. In this paper, we propose a robust high-quality 3D +content generation pipeline by exploiting orthogonal-view image guidance. +First, we introduce a novel 2D diffusion model that generates an image +consisting of four orthogonal-view sub-images based on the given text prompt. +Then, the 3D content is created using this diffusion model. Notably, the +generated orthogonal-view image provides strong geometric structure priors and +thus improves 3D consistency. As a result, it effectively resolves the Janus +problem and significantly enhances the quality of 3D content creation. +Additionally, we present a 3D synthesis fusion network that can further improve +the details of the generated 3D contents. Both quantitative and qualitative +evaluations demonstrate that our method surpasses previous text-to-3D +techniques. Project page: https://efficientdreamer.github.io.",cs.CV,['cs.CV'] +Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On,Xu Yang · Changxing Ding · Zhibin Hong · Junhao Huang · Jin Tao · Xiangmin Xu, ,https://arxiv.org/abs/2404.01089,,2404.01089.pdf,Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On,"Image-based virtual try-on is an increasingly important task for online +shopping. It aims to synthesize images of a specific person wearing a specified +garment. Diffusion model-based approaches have recently become popular, as they +are excellent at image synthesis tasks. However, these approaches usually +employ additional image encoders and rely on the cross-attention mechanism for +texture transfer from the garment to the person image, which affects the +try-on's efficiency and fidelity. To address these issues, we propose an +Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the +fidelity of the results and introduces no additional image encoders. +Accordingly, we make contributions from two aspects. First, we propose to +concatenate the masked person and reference garment images along the spatial +dimension and utilize the resulting image as the input for the diffusion +model's denoising UNet. This enables the original self-attention layers +contained in the diffusion model to achieve efficient and accurate texture +transfer. Second, we propose a novel diffusion-based method that predicts a +precise inpainting mask based on the person and reference garment images, +further enhancing the reliability of the try-on results. In addition, we +integrate mask prediction and image synthesis into a single compact model. The +experimental results show that our approach can be applied to various try-on +tasks, e.g., garment-to-person and person-to-person try-ons, and significantly +outperforms state-of-the-art methods on popular VITON, VITON-HD databases.",cs.CV,"['cs.CV', 'cs.AI']" +ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting,Yankai Jiang · Zhongzhen Huang · Rongzhao Zhang · Xiaofan Zhang · Shaoting Zhang,https://github.com/Yankai96/ZePT,https://arxiv.org/abs/2312.04964,,2312.04964.pdf,ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting,"The long-tailed distribution problem in medical image analysis reflects a +high prevalence of common conditions and a low prevalence of rare ones, which +poses a significant challenge in developing a unified model capable of +identifying rare or novel tumor categories not encountered during training. In +this paper, we propose a new zero-shot pan-tumor segmentation framework (ZePT) +based on query-disentangling and self-prompting to segment unseen tumor +categories beyond the training set. ZePT disentangles the object queries into +two subsets and trains them in two stages. Initially, it learns a set of +fundamental queries for organ segmentation through an object-aware feature +grouping strategy, which gathers organ-level visual features. Subsequently, it +refines the other set of advanced queries that focus on the auto-generated +visual prompts for unseen tumor segmentation. Moreover, we introduce +query-knowledge alignment at the feature level to enhance each query's +discriminative representation and generalizability. Extensive experiments on +various tumor segmentation tasks demonstrate the performance superiority of +ZePT, which surpasses the previous counterparts and evidence the promising +ability for zero-shot tumor segmentation in real-world settings.",cs.CV,['cs.CV'] +Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld,Yijun Yang · Tianyi Zhou · kanxue Li · Dapeng Tao · Lusong Li · Li Shen · Xiaodong He · Jing Jiang · Yuhui Shi,https://github.com/stevenyangyj/Emma-Alfworld,https://arxiv.org/abs/2311.16714v1,,2311.16714v1.pdf,Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld,"While large language models (LLMs) excel in a simulated world of texts, they +struggle to interact with the more realistic world without perceptions of other +modalities such as visual or audio signals. Although vision-language models +(VLMs) integrate LLM modules (1) aligned with static image features, and (2) +may possess prior knowledge of world dynamics (as demonstrated in the text +world), they have not been trained in an embodied visual world and thus cannot +align with its dynamics. On the other hand, training an embodied agent in a +noisy visual world without expert guidance is often challenging and +inefficient. In this paper, we train a VLM agent living in a visual world using +an LLM agent excelling in a parallel text world (but inapplicable to the visual +world). Specifically, we distill LLM's reflection outcomes (improved actions by +analyzing mistakes) in a text world's tasks to finetune the VLM on the same +tasks of the visual world, resulting in an Embodied Multi-Modal Agent (EMMA) +quickly adapting to the visual world dynamics. Such cross-modality imitation +learning between the two parallel worlds enables EMMA to generalize to a broad +scope of new tasks without any further guidance from the LLM expert. Extensive +evaluations on the ALFWorld benchmark highlight EMMA's superior performance to +SOTA VLM-based agents across diverse tasks, e.g., 20%-70% improvement in the +success rate.",cs.CV,['cs.CV'] +Mip-Splatting: Alias-free 3D Gaussian Splatting,Zehao Yu · Anpei Chen · Binbin Huang · Torsten Sattler · Andreas Geiger, ,https://arxiv.org/abs/2311.16493,,2311.16493.pdf,Mip-Splatting: Alias-free 3D Gaussian Splatting,"Recently, 3D Gaussian Splatting has demonstrated impressive novel view +synthesis results, reaching high fidelity and efficiency. However, strong +artifacts can be observed when changing the sampling rate, \eg, by changing +focal length or camera distance. We find that the source for this phenomenon +can be attributed to the lack of 3D frequency constraints and the usage of a 2D +dilation filter. To address this problem, we introduce a 3D smoothing filter +which constrains the size of the 3D Gaussian primitives based on the maximal +sampling frequency induced by the input views, eliminating high-frequency +artifacts when zooming in. Moreover, replacing 2D dilation with a 2D Mip +filter, which simulates a 2D box filter, effectively mitigates aliasing and +dilation issues. Our evaluation, including scenarios such a training on +single-scale images and testing on multiple scales, validates the effectiveness +of our approach.",cs.CV,['cs.CV'] +Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling,Baoquan Zhang · Huaibin Wang · Luo Chuyao · Xutao Li · Guotao liang · Yunming Ye · joeq · Yao He,https://youtu.be/N6M0jcMP9lo,https://arxiv.org/abs/2403.10071,,2403.10071.pdf,Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling,"Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in +image synthesis, which aims to represent an image with a discrete token +sequence. Existing studies effectively address this problem by learning a +discrete codebook from scratch and in a code-independent manner to quantize +continuous representations into discrete tokens. However, learning a codebook +from scratch and in a code-independent manner is highly challenging, which may +be a key reason causing codebook collapse, i.e., some code vectors can rarely +be optimized without regard to the relationship between codes and good codebook +priors such that die off finally. In this paper, inspired by pretrained +language models, we find that these language models have actually pretrained a +superior codebook via a large number of text corpus, but such information is +rarely exploited in VQIM. To this end, we propose a novel codebook transfer +framework with part-of-speech, called VQCT, which aims to transfer a +well-trained codebook from pretrained language models to VQIM for robust +codebook learning. Specifically, we first introduce a pretrained codebook from +language models and part-of-speech knowledge as priors. Then, we construct a +vision-related codebook with these priors for achieving codebook transfer. +Finally, a novel codebook transfer network is designed to exploit abundant +semantic relationships between codes contained in pretrained codebooks for +robust VQIM codebook learning. Experimental results on four datasets show that +our VQCT method achieves superior VQIM performance over previous +state-of-the-art methods.",cs.CV,['cs.CV'] +Multi-Level Neural Scene Graphs for Dynamic Urban Environments,Tobias Fischer · Lorenzo Porzi · Samuel Rota Bulò · Marc Pollefeys · Peter Kontschieder, ,https://arxiv.org/abs/2404.00168,,2404.00168.pdf,Multi-Level Neural Scene Graphs for Dynamic Urban Environments,"We estimate the radiance field of large-scale dynamic areas from multiple +vehicle captures under varying environmental conditions. Previous works in this +domain are either restricted to static environments, do not scale to more than +a single short video, or struggle to separately represent dynamic object +instances. To this end, we present a novel, decomposable radiance field +approach for dynamic urban environments. We propose a multi-level neural scene +graph representation that scales to thousands of images from dozens of +sequences with hundreds of fast-moving objects. To enable efficient training +and rendering of our representation, we develop a fast composite ray sampling +and rendering scheme. To test our approach in urban driving scenarios, we +introduce a new, novel view synthesis benchmark. We show that our approach +outperforms prior art by a significant margin on both established and our +proposed benchmark while being faster in training and rendering.",cs.CV,['cs.CV'] +Rethinking Multi-view Representation Learning via Distilled Disentangling,Guanzhou Ke · Bo Wang · Xiao-Li Wang · Shengfeng He, ,https://arxiv.org/abs/2403.10897,,2403.10897.pdf,Rethinking Multi-view Representation Learning via Distilled Disentangling,"Multi-view representation learning aims to derive robust representations that +are both view-consistent and view-specific from diverse data sources. This +paper presents an in-depth analysis of existing approaches in this domain, +highlighting a commonly overlooked aspect: the redundancy between +view-consistent and view-specific representations. To this end, we propose an +innovative framework for multi-view representation learning, which incorporates +a technique we term 'distilled disentangling'. Our method introduces the +concept of masked cross-view prediction, enabling the extraction of compact, +high-quality view-consistent representations from various sources without +incurring extra computational overhead. Additionally, we develop a distilled +disentangling module that efficiently filters out consistency-related +information from multi-view representations, resulting in purer view-specific +representations. This approach significantly reduces redundancy between +view-consistent and view-specific representations, enhancing the overall +efficiency of the learning process. Our empirical evaluations reveal that +higher mask ratios substantially improve the quality of view-consistent +representations. Moreover, we find that reducing the dimensionality of +view-consistent representations relative to that of view-specific +representations further refines the quality of the combined representations. +Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.",cs.CV,"['cs.CV', 'cs.MM']" +Neural Refinement for Absolute Pose Regression with Feature Synthesis,Shuai Chen · Yash Bhalgat · Xinghui Li · Jia-Wang Bian · Kejie Li · Zirui Wang · Victor Adrian Prisacariu, ,https://arxiv.org/html/2402.14371v2,,2402.14371v2.pdf,HR-APR: APR-agnostic Framework with Uncertainty Estimation and Hierarchical Refinement for Camera Relocalisation,"Absolute Pose Regressors (APRs) directly estimate camera poses from monocular +images, but their accuracy is unstable for different queries. Uncertainty-aware +APRs provide uncertainty information on the estimated pose, alleviating the +impact of these unreliable predictions. However, existing uncertainty modelling +techniques are often coupled with a specific APR architecture, resulting in +suboptimal performance compared to state-of-the-art (SOTA) APR methods. This +work introduces a novel APR-agnostic framework, HR-APR, that formulates +uncertainty estimation as cosine similarity estimation between the query and +database features. It does not rely on or affect APR network architecture, +which is flexible and computationally efficient. In addition, we take advantage +of the uncertainty for pose refinement to enhance the performance of APR. The +extensive experiments demonstrate the effectiveness of our framework, reducing +27.4\% and 15.2\% of computational overhead on the 7Scenes and Cambridge +Landmarks datasets while maintaining the SOTA accuracy in single-image APRs.",cs.CV,"['cs.CV', 'cs.RO']" +Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer,Zhen Zhao · Jingqun Tang · Chunhui Lin · Binghong Wu · Can Huang · Hao Liu · Xin Tan · Zhizhong Zhang · Yuan Xie,https://github.com/bytedance/E2STR,https://arxiv.org/abs/2311.13120,,2311.13120.pdf,Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer,"Scene text recognition (STR) in the wild frequently encounters challenges +when coping with domain variations, font diversity, shape deformations, etc. A +straightforward solution is performing model fine-tuning tailored to a specific +scenario, but it is computationally intensive and requires multiple model +copies for various scenarios. Recent studies indicate that large language +models (LLMs) can learn from a few demonstration examples in a training-free +manner, termed ""In-Context Learning"" (ICL). Nevertheless, applying LLMs as a +text recognizer is unacceptably resource-consuming. Moreover, our pilot +experiments on LLMs show that ICL fails in STR, mainly attributed to the +insufficient incorporation of contextual information from diverse samples in +the training stage. To this end, we introduce E$^2$STR, a STR model trained +with context-rich scene text sequences, where the sequences are generated via +our proposed in-context training strategy. E$^2$STR demonstrates that a +regular-sized model is sufficient to achieve effective ICL capabilities in STR. +Extensive experiments show that E$^2$STR exhibits remarkable training-free +adaptation in various scenarios and outperforms even the fine-tuned +state-of-the-art approaches on public benchmarks. The code is released at +https://github.com/bytedance/E2STR .",cs.CV,['cs.CV'] +Cross Initialization for Face Personalization of Text-to-Image Models,Lianyu Pang · Jian Yin · Haoran Xie · Qiping Wang · Qing Li · Xudong Mao, ,https://arxiv.org/abs/2312.15905,,2312.15905.pdf,Cross Initialization for Personalized Text-to-Image Generation,"Recently, there has been a surge in face personalization techniques, +benefiting from the advanced capabilities of pretrained text-to-image diffusion +models. Among these, a notable method is Textual Inversion, which generates +personalized images by inverting given images into textual embeddings. However, +methods based on Textual Inversion still struggle with balancing the trade-off +between reconstruction quality and editability. In this study, we examine this +issue through the lens of initialization. Upon closely examining traditional +initialization methods, we identified a significant disparity between the +initial and learned embeddings in terms of both scale and orientation. The +scale of the learned embedding can be up to 100 times greater than that of the +initial embedding. Such a significant change in the embedding could increase +the risk of overfitting, thereby compromising the editability. Driven by this +observation, we introduce a novel initialization method, termed Cross +Initialization, that significantly narrows the gap between the initial and +learned embeddings. This method not only improves both reconstruction and +editability but also reduces the optimization steps from 5000 to 320. +Furthermore, we apply a regularization term to keep the learned embedding close +to the initial embedding. We show that when combined with Cross Initialization, +this regularization term can effectively improve editability. We provide +comprehensive empirical evidence to demonstrate the superior performance of our +method compared to the baseline methods. Notably, in our experiments, Cross +Initialization is the only method that successfully edits an individual's +facial expression. Additionally, a fast version of our method allows for +capturing an input image in roughly 26 seconds, while surpassing the baseline +methods in terms of both reconstruction and editability. Code will be made +publicly available.",cs.CV,['cs.CV'] +Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis,Marianna Ohanyan · Hayk Manukyan · Zhangyang Wang · Shant Navasardyan · Humphrey Shi, ,https://arxiv.org/abs/2311.12342,,2311.12342.pdf,LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis,"Recent text-to-image diffusion models have reached an unprecedented level in +generating high-quality images. However, their exclusive reliance on textual +prompts often falls short in precise control of image compositions. In this +paper, we propose LoCo, a training-free approach for layout-to-image Synthesis +that excels in producing high-quality images aligned with both textual prompts +and layout instructions. Specifically, we introduce a Localized Attention +Constraint (LAC), leveraging semantic affinity between pixels in self-attention +maps to create precise representations of desired objects and effectively +ensure the accurate placement of objects in designated regions. We further +propose a Padding Token Constraint (PTC) to leverage the semantic information +embedded in previously neglected padding tokens, improving the consistency +between object appearance and layout instructions. LoCo seamlessly integrates +into existing text-to-image and layout-to-image models, enhancing their +performance in spatial control and addressing semantic failures observed in +prior methods. Extensive experiments showcase the superiority of our approach, +surpassing existing state-of-the-art training-free layout-to-image methods both +qualitatively and quantitatively across multiple benchmarks.",cs.CV,['cs.CV'] +NB-GTR: Narrow-Band Guided Turbulence Removal,Yifei Xia · Chu Zhou · Chengxuan Zhu · Minggui Teng · Chao Xu · Boxin Shi, ,,https://freebutuselesssoul.github.io/publications/cvpr2024b,,,,,nan +SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models,Yuzhou Huang · Liangbin Xie · Xintao Wang · Ziyang Yuan · Xiaodong Cun · Yixiao Ge · Jiantao Zhou · Chao Dong · Rui Huang · Ruimao Zhang · Ying Shan, ,https://arxiv.org/abs/2312.06739,,2312.06739.pdf,SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models,"Current instruction-based editing methods, such as InstructPix2Pix, often +fail to produce satisfactory results in complex scenarios due to their +dependence on the simple CLIP text encoder in diffusion models. To rectify +this, this paper introduces SmartEdit, a novel approach to instruction-based +image editing that leverages Multimodal Large Language Models (MLLMs) to +enhance their understanding and reasoning capabilities. However, direct +integration of these elements still faces challenges in situations requiring +complex reasoning. To mitigate this, we propose a Bidirectional Interaction +Module that enables comprehensive bidirectional information interactions +between the input image and the MLLM output. During training, we initially +incorporate perception data to boost the perception and understanding +capabilities of diffusion models. Subsequently, we demonstrate that a small +amount of complex instruction editing data can effectively stimulate +SmartEdit's editing capabilities for more complex instructions. We further +construct a new evaluation dataset, Reason-Edit, specifically tailored for +complex instruction-based image editing. Both quantitative and qualitative +results on this evaluation dataset indicate that our SmartEdit surpasses +previous methods, paving the way for the practical application of complex +instruction-based image editing.",cs.CV,['cs.CV'] +FADES: Fair Disentanglement with Sensitive Relevance,Taeuk Jang · Xiaoqian Wang, ,https://arxiv.org/abs/2405.07011,,2405.07011.pdf,Fair Graph Representation Learning via Sensitive Attribute Disentanglement,"Group fairness for Graph Neural Networks (GNNs), which emphasizes algorithmic +decisions neither favoring nor harming certain groups defined by sensitive +attributes (e.g., race and gender), has gained considerable attention. In +particular, the objective of group fairness is to ensure that the decisions +made by GNNs are independent of the sensitive attribute. To achieve this +objective, most existing approaches involve eliminating sensitive attribute +information in node representations or algorithmic decisions. However, such +ways may also eliminate task-related information due to its inherent +correlation with the sensitive attribute, leading to a sacrifice in utility. In +this work, we focus on improving the fairness of GNNs while preserving +task-related information and propose a fair GNN framework named FairSAD. +Instead of eliminating sensitive attribute information, FairSAD enhances the +fairness of GNNs via Sensitive Attribute Disentanglement (SAD), which separates +the sensitive attribute-related information into an independent component to +mitigate its impact. Additionally, FairSAD utilizes a channel masking mechanism +to adaptively identify the sensitive attribute-related component and +subsequently decorrelates it. Overall, FairSAD minimizes the impact of the +sensitive attribute on GNN outcomes rather than eliminating sensitive +attributes, thereby preserving task-related information associated with the +sensitive attribute. Furthermore, experiments conducted on several real-world +datasets demonstrate that FairSAD outperforms other state-of-the-art methods by +a significant margin in terms of both fairness and utility performance. Our +source code is available at https://github.com/ZzoomD/FairSAD.",cs.LG,"['cs.LG', 'cs.CY']" +VRP-SAM: SAM with Visual Reference Prompt,Yanpeng Sun · Jiahui Chen · Shan Zhang · Xinyu Zhang · Qiang Chen · gang zhang · Errui Ding · Jingdong Wang · Zechao Li, ,https://arxiv.org/abs/2402.17726,,2402.17726.pdf,VRP-SAM: SAM with Visual Reference Prompt,"In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that +empowers the Segment Anything Model (SAM) to utilize annotated reference images +as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM +can utilize annotated reference images to comprehend specific objects and +perform segmentation of specific objects in target image. It is note that the +VRP encoder can support a variety of annotation formats for reference images, +including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. +VRP-SAM achieves a breakthrough within the SAM framework by extending its +versatility and applicability while preserving SAM's inherent strengths, thus +enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, +the VRP encoder adopts a meta-learning strategy. To validate the effectiveness +of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO +datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual +reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM +demonstrates strong generalization capabilities, allowing it to perform +segmentation of unseen objects and enabling cross-domain segmentation. The +source code and models will be available at +\url{https://github.com/syp2ysy/VRP-SAM}",cs.CV,['cs.CV'] +Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It,Adam Lilja · Junsheng Fu · Erik Stenborg · Lars Hammarstrand,https://github.com/LiljaAdam/geographical-splits,https://arxiv.org/abs/2312.06420,,2312.06420.pdf,Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It,"The task of online mapping is to predict a local map using current sensor +observations, e.g. from lidar and camera, without relying on a pre-built map. +State-of-the-art methods are based on supervised learning and are trained +predominantly using two datasets: nuScenes and Argoverse 2. However, these +datasets revisit the same geographic locations across training, validation, and +test sets. Specifically, over $80$% of nuScenes and $40$% of Argoverse 2 +validation and test samples are less than $5$ m from a training sample. At test +time, the methods are thus evaluated more on how well they localize within a +memorized implicit map built from the training data than on extrapolating to +unseen locations. Naturally, this data leakage causes inflated performance +numbers and we propose geographically disjoint data splits to reveal the true +performance in unseen environments. Experimental results show that methods +perform considerably worse, some dropping more than $45$ mAP, when trained and +evaluated on proper data splits. Additionally, a reassessment of prior design +choices reveals diverging conclusions from those based on the original split. +Notably, the impact of lifting methods and the support from auxiliary tasks +(e.g., depth supervision) on performance appears less substantial or follows a +different trajectory than previously perceived. Splits can be found at +https://github.com/LiljaAdam/geographical-splits",cs.CV,['cs.CV'] +Gated Fields: Learning Scene Reconstruction from Gated Videos,Andrea Ramazzina · Stefanie Walz · Pragyan Dahal · Mario Bijelic · Felix Heide, ,https://arxiv.org/abs/2405.19819,,2405.19819.pdf,Gated Fields: Learning Scene Reconstruction from Gated Videos,"Reconstructing outdoor 3D scenes from temporal observations is a challenge +that recent work on neural fields has offered a new avenue for. However, +existing methods that recover scene properties, such as geometry, appearance, +or radiance, solely from RGB captures often fail when handling poorly-lit or +texture-deficient regions. Similarly, recovering scenes with scanning LiDAR +sensors is also difficult due to their low angular sampling rate which makes +recovering expansive real-world scenes difficult. Tackling these gaps, we +introduce Gated Fields - a neural scene reconstruction method that utilizes +active gated video sequences. To this end, we propose a neural rendering +approach that seamlessly incorporates time-gated capture and illumination. Our +method exploits the intrinsic depth cues in the gated videos, achieving precise +and dense geometry reconstruction irrespective of ambient illumination +conditions. We validate the method across day and night scenarios and find that +Gated Fields compares favorably to RGB and LiDAR reconstruction methods. Our +code and datasets are available at https://light.princeton.edu/gatedfields/.",cs.CV,['cs.CV'] +VINECS: Video-based Neural Character Skinning,Zhouyingcheng Liao · Vladislav Golyanik · Marc Habermann · Christian Theobalt, ,https://arxiv.org/abs/2307.00842,,2307.00842.pdf,VINECS: Video-based Neural Character Skinning,"Rigging and skinning clothed human avatars is a challenging task and +traditionally requires a lot of manual work and expertise. Recent methods +addressing it either generalize across different characters or focus on +capturing the dynamics of a single character observed under different pose +configurations. However, the former methods typically predict solely static +skinning weights, which perform poorly for highly articulated poses, and the +latter ones either require dense 3D character scans in different poses or +cannot generate an explicit mesh with vertex correspondence over time. To +address these challenges, we propose a fully automated approach for creating a +fully rigged character with pose-dependent skinning weights, which can be +solely learned from multi-view video. Therefore, we first acquire a rigged +template, which is then statically skinned. Next, a coordinate-based MLP learns +a skinning weights field parameterized over the position in a canonical pose +space and the respective pose. Moreover, we introduce our pose- and +view-dependent appearance field allowing us to differentiably render and +supervise the posed mesh using multi-view imagery. We show that our approach +outperforms state-of-the-art while not relying on dense 4D scans.",cs.CV,['cs.CV'] +LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,Chuwei Luo · Yufan Shen · Zhaoqing Zhu · Qi Zheng · Zhi Yu · Cong Yao,https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM,https://arxiv.org/abs/2404.05225,,2404.05225.pdf,LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,"Recently, leveraging large language models (LLMs) or multimodal large +language models (MLLMs) for document understanding has been proven very +promising. However, previous works that employ LLMs/MLLMs for document +understanding have not fully explored and utilized the document layout +information, which is vital for precise document understanding. In this paper, +we propose LayoutLLM, an LLM/MLLM based method for document understanding. The +core of LayoutLLM is a layout instruction tuning strategy, which is specially +designed to enhance the comprehension and utilization of document layouts. The +proposed layout instruction tuning strategy consists of two components: +Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture +the characteristics of document layout in Layout-aware Pre-training, three +groups of pre-training tasks, corresponding to document-level, region-level and +segment-level information, are introduced. Furthermore, a novel module called +layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on +regions relevant to the question and generate accurate answers. LayoutCoT is +effective for boosting the performance of document understanding. Meanwhile, it +brings a certain degree of interpretability, which could facilitate manual +inspection and correction. Experiments on standard benchmarks show that the +proposed LayoutLLM significantly outperforms existing methods that adopt +open-source 7B LLMs/MLLMs for document understanding. The training data of the +LayoutLLM is publicly available at +https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM",cs.CV,"['cs.CV', 'cs.CL']" +DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance,Zixuan Wang · Jia Jia · Shikun Sun · Haozhe Wu · Rong Han · Zhenyu Li · Di Tang · Jiaqing Zhou · Jiebo Luo, ,https://arxiv.org/abs/2403.13667,,2403.13667.pdf,DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance,"Choreographers determine what the dances look like, while cameramen determine +the final presentation of dances. Recently, various methods and datasets have +showcased the feasibility of dance synthesis. However, camera movement +synthesis with music and dance remains an unsolved challenging problem due to +the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D +dataset, which for the first time combines camera movement with dance motion +and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of +paired dance-camera-music data from the anime community, covering 4 music +genres. With this dataset, we uncover that dance camera movement is +multifaceted and human-centric, and possesses multiple influencing factors, +making dance camera synthesis a more challenging task compared to camera or +dance synthesis alone. To overcome these difficulties, we propose +DanceCamera3D, a transformer-based diffusion model that incorporates a novel +body attention loss and a condition separation strategy. For evaluation, we +devise new metrics measuring camera movement quality, diversity, and dancer +fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM +dataset, providing both quantitative and qualitative evidence showcasing the +effectiveness of our DanceCamera3D model. Code and video demos are available at +https://github.com/Carmenw1203/DanceCamera3D-Official.",cs.CV,"['cs.CV', 'cs.MM']" +Active Prompt Learning in Vision Language Models,Jihwan Bang · Sumyeong Ahn · Jae-Gil Lee, ,https://arxiv.org/abs/2311.11178,,2311.11178.pdf,Active Prompt Learning in Vision Language Models,"Pre-trained Vision Language Models (VLMs) have demonstrated notable progress +in various zero-shot tasks, such as classification and retrieval. Despite their +performance, because improving performance on new tasks requires task-specific +knowledge, their adaptation is essential. While labels are needed for the +adaptation, acquiring them is typically expensive. To overcome this challenge, +active learning, a method of achieving a high performance by obtaining labels +for a small number of samples from experts, has been studied. Active learning +primarily focuses on selecting unlabeled samples for labeling and leveraging +them to train models. In this study, we pose the question, ""how can the +pre-trained VLMs be adapted under the active learning framework?"" In response +to this inquiry, we observe that (1) simply applying a conventional active +learning framework to pre-trained VLMs even may degrade performance compared to +random selection because of the class imbalance in labeling candidates, and (2) +the knowledge of VLMs can provide hints for achieving the balance before +labeling. Based on these observations, we devise a novel active learning +framework for VLMs, denoted as PCB. To assess the effectiveness of our +approach, we conduct experiments on seven different real-world datasets, and +the results demonstrate that PCB surpasses conventional active learning and +random sampling methods. Code will be available in +https://github.com/kaist-dmlab/pcb .",cs.CV,['cs.CV'] +One-Prompt to Segment All Medical Images,Wu · Min Xu, ,https://arxiv.org/html/2305.10300v3,,2305.10300v3.pdf,One-Prompt to Segment All Medical Images,"Large foundation models, known for their strong zero-shot generalization, +have excelled in visual and language applications. However, applying them to +medical image segmentation, a domain with diverse imaging types and target +labels, remains an open challenge. Current approaches, such as adapting +interactive segmentation models like Segment Anything Model (SAM), require user +prompts for each sample during inference. Alternatively, transfer learning +methods like few/one-shot models demand labeled samples, leading to high costs. +This paper introduces a new paradigm toward the universal medical image +segmentation, termed 'One-Prompt Segmentation.' One-Prompt Segmentation +combines the strengths of one-shot and interactive methods. In the inference +stage, with just \textbf{one prompted sample}, it can adeptly handle the unseen +task in a single forward pass. We train One-Prompt Model on 64 open-source +medical datasets, accompanied by the collection of over 3,000 clinician-labeled +prompts. Tested on 14 previously unseen tasks, the One-Prompt Model showcases +superior zero-shot segmentation capabilities, outperforming a wide range of +related methods. The code and annotated data will be publicly released.",eess.IV,"['eess.IV', 'cs.CV']" +Reconstructing Hands in 3D with Transformers,Georgios Pavlakos · Dandan Shan · Ilija Radosavovic · Angjoo Kanazawa · David Fouhey · Jitendra Malik, ,https://arxiv.org/abs/2312.05251,,2312.05251.pdf,Reconstructing Hands in 3D with Transformers,"We present an approach that can reconstruct hands in 3D from monocular input. +Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based +architecture and can analyze hands with significantly increased accuracy and +robustness compared to previous work. The key to HaMeR's success lies in +scaling up both the data used for training and the capacity of the deep network +for hand reconstruction. For training data, we combine multiple datasets that +contain 2D or 3D hand annotations. For the deep model, we use a large scale +Vision Transformer architecture. Our final model consistently outperforms the +previous baselines on popular 3D hand pose benchmarks. To further evaluate the +effect of our design in non-controlled settings, we annotate existing +in-the-wild datasets with 2D hand keypoint annotations. On this newly collected +dataset of annotations, HInt, we demonstrate significant improvements over +existing baselines. We make our code, data and models available on the project +website: https://geopavlakos.github.io/hamer/.",cs.CV,['cs.CV'] +Can Biases in ImageNet Models Explain Generalization?,Paul Gavrikov · Janis Keuper,https://github.com/paulgavrikov/biases_vs_generalization,https://arxiv.org/abs/2404.01509,,2404.01509.pdf,Can Biases in ImageNet Models Explain Generalization?,"The robust generalization of models to rare, in-distribution (ID) samples +drawn from the long tail of the training distribution and to +out-of-training-distribution (OOD) samples is one of the major challenges of +current deep learning methods. For image classification, this manifests in the +existence of adversarial attacks, the performance drops on distorted images, +and a lack of generalization to concepts such as sketches. The current +understanding of generalization in neural networks is very limited, but some +biases that differentiate models from human vision have been identified and +might be causing these limitations. Consequently, several attempts with varying +success have been made to reduce these biases during training to improve +generalization. We take a step back and sanity-check these attempts. Fixing the +architecture to the well-established ResNet-50, we perform a large-scale study +on 48 ImageNet models obtained via different training methods to understand how +and if these biases - including shape bias, spectral biases, and critical bands +- interact with generalization. Our extensive study results reveal that +contrary to previous findings, these biases are insufficient to accurately +predict the generalization of a model holistically. We provide access to all +checkpoints and evaluation code at +https://github.com/paulgavrikov/biases_vs_generalization",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'stat.ML']" +"Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA",Zhuowan Li · Bhavan Jasani · Peng Tang · Shabnam Ghadar, ,https://arxiv.org/abs/2403.16385,,2403.16385.pdf,"Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA","Understanding data visualizations like charts and plots requires reasoning +about both visual elements and numerics. Although strong in extractive +questions, current chart visual question answering (chart VQA) models suffer on +complex reasoning questions. In this work, we address the lack of reasoning +ability by data augmentation. We leverage Large Language Models (LLMs), which +have shown to have strong reasoning ability, as an automatic data annotator +that generates question-answer annotations for chart images. The key innovation +in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data +generator learns to decompose the complex question into step-by-step +sub-questions (rationales), which are then used to derive the final answer +using external tools, i.e. Python. This step-wise generation procedure is +trained on synthetic data generated using a template-based QA generation +pipeline. Experimental results highlight the significance of the proposed +step-by-step generation. By training with the LLM-augmented data (LAMENDA), we +significantly enhance the chart VQA models, achieving the state-of-the-art +accuracy on the ChartQA and PlotQA datasets. In particular, our approach +improves the accuracy of the previous state-of-the-art approach from 38% to 54% +on the human-written questions in the ChartQA dataset, which needs strong +reasoning. We hope our work underscores the potential of synthetic data and +encourages further exploration of data augmentation using LLMs for +reasoning-heavy tasks.",cs.CV,"['cs.CV', 'cs.CL']" +TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video,Minye Wu · Zehao Wang · Georgios Kouros · Tinne Tuytelaars, ,https://arxiv.org/abs/2312.06713,,2312.06713.pdf,TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video,"Neural Radiance Fields (NeRF) revolutionize the realm of visual media by +providing photorealistic Free-Viewpoint Video (FVV) experiences, offering +viewers unparalleled immersion and interactivity. However, the technology's +significant storage requirements and the computational complexity involved in +generation and rendering currently limit its broader application. To close this +gap, this paper presents Temporal Tri-Plane Radiance Fields (TeTriRF), a novel +technology that significantly reduces the storage size for Free-Viewpoint Video +(FVV) while maintaining low-cost generation and rendering. TeTriRF introduces a +hybrid representation with tri-planes and voxel grids to support scaling up to +long-duration sequences and scenes with complex motions or rapid changes. We +propose a group training scheme tailored to achieving high training efficiency +and yielding temporally consistent, low-entropy scene representations. +Leveraging these properties of the representations, we introduce a compression +pipeline with off-the-shelf video codecs, achieving an order of magnitude less +storage size compared to the state-of-the-art. Our experiments demonstrate that +TeTriRF can achieve competitive quality with a higher compression rate.",cs.CV,['cs.CV'] +Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation,Alexander Raistrick · Lingjie Mei · Karhan Kayan · David Yan · Yiming Zuo · Beining Han · Hongyu Wen · Meenal Parakh · Stamatis Alexandropoulos · Lahav Lipson · Zeyu Ma · Jia Deng, ,https://arxiv.org/abs/2306.09310,,2306.09310.pdf,Infinite Photorealistic Worlds using Procedural Generation,"We introduce Infinigen, a procedural generator of photorealistic 3D scenes of +the natural world. Infinigen is entirely procedural: every asset, from shape to +texture, is generated from scratch via randomized mathematical rules, using no +external source and allowing infinite variation and composition. Infinigen +offers broad coverage of objects and scenes in the natural world including +plants, animals, terrains, and natural phenomena such as fire, cloud, rain, and +snow. Infinigen can be used to generate unlimited, diverse training data for a +wide range of computer vision tasks including object detection, semantic +segmentation, optical flow, and 3D reconstruction. We expect Infinigen to be a +useful resource for computer vision research and beyond. Please visit +https://infinigen.org for videos, code and pre-generated data.",cs.CV,['cs.CV'] +TetraSphere: A Neural Descriptor for O(3)-Invariant Point Cloud Analysis,Pavlo Melnyk · Andreas Robinson · Michael Felsberg · Mårten Wadenbäck,https://github.com/pavlo-melnyk/tetrasphere,,https://www.youtube.com/watch?v=MRJr0V7eMj8,,,,,nan +Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft,Hao Li · Xue Yang · Zhaokai Wang · Xizhou Zhu · Jie Zhou · Yu Qiao · Xiaogang Wang · Hongsheng Li · Lewei Lu · Jifeng Dai,https://yangxue0827.github.io/auto_mc-reward.html,https://arxiv.org/abs/2312.09238,,2312.09238.pdf,Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft,"Many reinforcement learning environments (e.g., Minecraft) provide only +sparse rewards that indicate task completion or failure with binary values. The +challenge in exploration efficiency in such environments makes it difficult for +reinforcement-learning-based agents to learn complex tasks. To address this, +this paper introduces an advanced learning system, named Auto MC-Reward, that +leverages Large Language Models (LLMs) to automatically design dense reward +functions, thereby enhancing the learning efficiency. Auto MC-Reward consists +of three important components: Reward Designer, Reward Critic, and Trajectory +Analyzer. Given the environment information and task descriptions, the Reward +Designer first design the reward function by coding an executable Python +function with predefined observation inputs. Then, our Reward Critic will be +responsible for verifying the code, checking whether the code is +self-consistent and free of syntax and semantic errors. Further, the Trajectory +Analyzer summarizes possible failure causes and provides refinement suggestions +according to collected trajectories. In the next round, Reward Designer will +further refine and iterate the dense reward function based on feedback. +Experiments demonstrate a significant improvement in the success rate and +learning efficiency of our agents in complex tasks in Minecraft, such as +obtaining diamond with the efficient ability to avoid lava, and efficiently +explore trees and animals that are sparse in the plains biome.",cs.AI,"['cs.AI', 'cs.CL', 'cs.CV', 'cs.LG']" +Practical Measurements of Translucent Materials with Inter-Pixel Translucency Prior,Zhenyu Chen · Jie Guo · Shuichang Lai · Ruoyu Fu · mengxun kong · Chen Wang · Hongyu Sun · Zhebin Zhang · Chen Li · Yanwen Guo, ,,https://github.com/ZhenyuChen1999/IPTNet,,,,,nan +Dual-View Visual Contextualization for Web Navigation,Jihyung Kil · Chan Hee Song · Boyuan Zheng · Xiang Deng · Yu Su · Wei-Lun Chao, ,https://arxiv.org/abs/2402.04476,,2402.04476.pdf,Dual-View Visual Contextualization for Web Navigation,"Automatic web navigation aims to build a web agent that can follow language +instructions to execute complex and diverse tasks on real-world websites. +Existing work primarily takes HTML documents as input, which define the +contents and action spaces (i.e., actionable elements and operations) of +webpages. Nevertheless, HTML documents may not provide a clear task-related +context for each element, making it hard to select the right (sequence of) +actions. In this paper, we propose to contextualize HTML elements through their +""dual views"" in webpage screenshots: each HTML element has its corresponding +bounding box and visual content in the screenshot. We build upon the insight -- +web developers tend to arrange task-related elements nearby on webpages to +enhance user experiences -- and propose to contextualize each element with its +neighbor elements, using both textual and visual features. The resulting +representations of HTML elements are more informative for the agent to take +action. We validate our method on the recently released Mind2Web dataset, which +features diverse navigation domains and tasks on real-world websites. Our +method consistently outperforms the baseline in all the scenarios, including +cross-task, cross-website, and cross-domain ones.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection,Mikhail Kennerley · Jian-Gang Wang · Bharadwaj Veeravalli · Robby T. Tan,https://www.mikhailkennerley.com/cat,https://arxiv.org/abs/2403.19278v1,,2403.19278v1.pdf,CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection,"Domain adaptive object detection aims to adapt detection models to domains +where annotated data is unavailable. Existing methods have been proposed to +address the domain gap using the semi-supervised student-teacher framework. +However, a fundamental issue arises from the class imbalance in the labelled +training set, which can result in inaccurate pseudo-labels. The relationship +between classes, especially where one class is a majority and the other +minority, has a large impact on class bias. We propose Class-Aware Teacher +(CAT) to address the class bias issue in the domain adaptation setting. In our +work, we approximate the class relationships with our Inter-Class Relation +module (ICRm) and exploit it to reduce the bias within the model. In this way, +we are able to apply augmentations to highly related classes, both inter- and +intra-domain, to boost the performance of minority classes while having minimal +impact on majority classes. We further reduce the bias by implementing a +class-relation weight to our classification loss. Experiments conducted on +various datasets and ablation studies show that our method is able to address +the class bias in the domain adaptation setting. On the Cityscapes to Foggy +Cityscapes dataset, we attained a 52.5 mAP, a substantial improvement over the +51.2 mAP achieved by the state-of-the-art method.",cs.CV,['cs.CV'] +HOISDF: Constraining 3D Hand Object Pose Estimation with Global Signed Distance Fields,Haozhe Qi · Chen Zhao · Mathieu Salzmann · Alexander Mathis, ,https://arxiv.org/abs/2402.17062,,2402.17062.pdf,HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields,"Human hands are highly articulated and versatile at handling objects. Jointly +estimating the 3D poses of a hand and the object it manipulates from a +monocular camera is challenging due to frequent occlusions. Thus, existing +methods often rely on intermediate 3D shape representations to increase +performance. These representations are typically explicit, such as 3D point +clouds or meshes, and thus provide information in the direct surroundings of +the intermediate hand pose estimate. To address this, we introduce HOISDF, a +Signed Distance Field (SDF) guided hand-object pose estimation network, which +jointly exploits hand and object SDFs to provide a global, implicit +representation over the complete reconstruction volume. Specifically, the role +of the SDFs is threefold: equip the visual encoder with implicit shape +information, help to encode hand-object interactions, and guide the hand and +object pose regression via SDF-based sampling and by augmenting the feature +representations. We show that HOISDF achieves state-of-the-art results on +hand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is available +at https://github.com/amathislab/HOISDF",cs.CV,['cs.CV'] +Learning Object State Changes in Videos: An Open-World Perspective,Zihui Xue · Kumar Ashutosh · Kristen Grauman,https://vision.cs.utexas.edu/projects/VidOSC/,https://arxiv.org/abs/2312.11782,,2312.11782.pdf,Learning Object State Changes in Videos: An Open-World Perspective,"Object State Changes (OSCs) are pivotal for video understanding. While humans +can effortlessly generalize OSC understanding from familiar to unknown objects, +current approaches are confined to a closed vocabulary. Addressing this gap, we +introduce a novel open-world formulation for the video OSC problem. The goal is +to temporally localize the three stages of an OSC -- the object's initial +state, its transitioning state, and its end state -- whether or not the object +has been observed during training. Towards this end, we develop VidOSC, a +holistic learning approach that: (1) leverages text and vision-language models +for supervisory signals to obviate manually labeling OSC training data, and (2) +abstracts fine-grained shared state representations from objects to enhance +generalization. Furthermore, we present HowToChange, the first open-world +benchmark for video OSC localization, which offers an order of magnitude +increase in the label space and annotation volume compared to the best existing +benchmark. Experimental results demonstrate the efficacy of our approach, in +both traditional closed-world and open-world scenarios.",cs.CV,['cs.CV'] +Depth Prompting for Sensor-Agnostic Depth Estimation,Jin-Hwi Park · Chanhwi Jeong · Junoh Lee · Hae-Gon Jeon, ,https://arxiv.org/abs/2405.11867,,2405.11867.pdf,Depth Prompting for Sensor-Agnostic Depth Estimation,"Dense depth maps have been used as a key element of visual perception tasks. +There have been tremendous efforts to enhance the depth quality, ranging from +optimization-based to learning-based methods. Despite the remarkable progress +for a long time, their applicability in the real world is limited due to +systematic measurement biases such as density, sensing pattern, and scan range. +It is well-known that the biases make it difficult for these methods to achieve +their generalization. We observe that learning a joint representation for input +modalities (e.g., images and depth), which most recent methods adopt, is +sensitive to the biases. In this work, we disentangle those modalities to +mitigate the biases with prompt engineering. For this, we design a novel depth +prompt module to allow the desirable feature representation according to new +depth distributions from either sensor types or scene configurations. Our depth +prompt can be embedded into foundation models for monocular depth estimation. +Through this embedding process, our method helps the pretrained model to be +free from restraint of depth scan range and to provide absolute scale depth +maps. We demonstrate the effectiveness of our method through extensive +evaluations. Source code is publicly available at +https://github.com/JinhwiPark/DepthPrompting .",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']" +PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought,Junyi Yao · Yijiang Liu · Zhen Dong · Mingfei Guo · Helan Hu · Kurt Keutzer · Li Du · Daquan Zhou · Shanghang Zhang, ,https://arxiv.org/abs/2307.13339,,2307.13339.pdf,Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions,"Chain-of-thought (CoT) prompting has been shown to empirically improve the +accuracy of large language models (LLMs) on various question answering tasks. +While understanding why CoT prompting is effective is crucial to ensuring that +this phenomenon is a consequence of desired model behavior, little work has +addressed this; nonetheless, such an understanding is a critical prerequisite +for responsible model deployment. We address this question by leveraging +gradient-based feature attribution methods which produce saliency scores that +capture the influence of input tokens on model output. Specifically, we probe +several open-source LLMs to investigate whether CoT prompting affects the +relative importances they assign to particular input tokens. Our results +indicate that while CoT prompting does not increase the magnitude of saliency +scores attributed to semantically relevant tokens in the prompt compared to +standard few-shot prompting, it increases the robustness of saliency scores to +question perturbations and variations in model output.",cs.CL,"['cs.CL', 'cs.AI']" +Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation,Wenxiao Deng · Wenbin Li · Tianyu Ding · Lei Wang · Hongguang Zhang · Kuihua Huang · Jing Huo · Yang Gao, ,https://arxiv.org/abs/2404.00563,,2404.00563.pdf,Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation,"Dataset distillation has emerged as a promising approach in deep learning, +enabling efficient training with small synthetic datasets derived from larger +real ones. Particularly, distribution matching-based distillation methods +attract attention thanks to its effectiveness and low computational cost. +However, these methods face two primary limitations: the dispersed feature +distribution within the same class in synthetic datasets, reducing class +discrimination, and an exclusive focus on mean feature consistency, lacking +precision and comprehensiveness. To address these challenges, we introduce two +novel constraints: a class centralization constraint and a covariance matching +constraint. The class centralization constraint aims to enhance class +discrimination by more closely clustering samples within classes. The +covariance matching constraint seeks to achieve more accurate feature +distribution matching between real and synthetic datasets through local feature +covariance matrices, particularly beneficial when sample sizes are much smaller +than the number of features. Experiments demonstrate notable improvements with +these constraints, yielding performance boosts of up to 6.6% on CIFAR10, 2.9% +on SVHN, 2.5% on CIFAR100, and 2.5% on TinyImageNet, compared to the +state-of-the-art relevant methods. In addition, our method maintains robust +performance in cross-architecture settings, with a maximum performance drop of +1.7% on four architectures. Code is available at +https://github.com/VincenDen/IID.",cs.CV,['cs.CV'] +MeshPose: Unifying DensePose and 3D Body Mesh reconstruction,Eric-Tuan Le · Antonios Kakolyris · Petros Koutras · Himmy Tam · Efstratios Skordos · George Papandreou · Riza Alp Guler · Iasonas Kokkinos, ,https://arxiv.org/abs/2308.10305,,2308.10305.pdf,Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video,"Despite significant progress in single image-based 3D human mesh recovery, +accurately and smoothly recovering 3D human motion from a video remains +challenging. Existing video-based methods generally recover human mesh by +estimating the complex pose and shape parameters from coupled image features, +whose high complexity and low representation ability often result in +inconsistent pose motion and limited shape patterns. To alleviate this issue, +we introduce 3D pose as the intermediary and propose a Pose and Mesh +Co-Evolution network (PMCE) that decouples this task into two parts: 1) +video-based 3D human pose estimation and 2) mesh vertices regression from the +estimated 3D pose and temporal image feature. Specifically, we propose a +two-stream encoder that estimates mid-frame 3D pose and extracts a temporal +image feature from the input image sequence. In addition, we design a +co-evolution decoder that performs pose and mesh interactions with the +image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the +human body shape. Extensive experiments demonstrate that the proposed PMCE +outperforms previous state-of-the-art methods in terms of both per-frame +accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, +and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces,Haithem Turki · Vasu Agrawal · Samuel Rota Bulò · Lorenzo Porzi · Peter Kontschieder · Deva Ramanan · Michael Zollhoefer · Christian Richardt,https://haithemturki.com/hybrid-nerf/,https://arxiv.org/abs/2312.03160,,2312.03160.pdf,HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces,"Neural radiance fields provide state-of-the-art view synthesis quality but +tend to be slow to render. One reason is that they make use of volume +rendering, thus requiring many samples (and model queries) per ray at render +time. Although this representation is flexible and easy to optimize, most +real-world objects can be modeled more efficiently with surfaces instead of +volumes, requiring far fewer samples per ray. This observation has spurred +considerable progress in surface representations such as signed distance +functions, but these may struggle to model semi-opaque and thin structures. We +propose a method, HybridNeRF, that leverages the strengths of both +representations by rendering most objects as surfaces while modeling the +(typically) small fraction of challenging regions volumetrically. We evaluate +HybridNeRF against the challenging Eyeful Tower dataset along with other +commonly used view synthesis datasets. When comparing to state-of-the-art +baselines, including recent rasterization-based approaches, we improve error +rates by 15-30% while achieving real-time framerates (at least 36 FPS) for +virtual-reality resolutions (2Kx2K).",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +LTA-PCS: Learnable Task-Agnostic Point Cloud Sampling,Jiaheng Liu · Jianhao Li · Kaisiyuan Wang · Hongcheng Guo · Jian Yang · Junran Peng · Ke Xu · Xianglong Liu · Jinyang Guo, ,https://arxiv.org/abs/2404.00857,,2404.00857.pdf,Meta Episodic learning with Dynamic Task Sampling for CLIP-based Point Cloud Classification,"Point cloud classification refers to the process of assigning semantic labels +or categories to individual points within a point cloud data structure. Recent +works have explored the extension of pre-trained CLIP to 3D recognition. In +this direction, CLIP-based point cloud models like PointCLIP, CLIP2Point have +become state-of-the-art methods in the few-shot setup. Although these methods +show promising performance for some classes like airplanes, desks, guitars, +etc, the performance for some classes like the cup, flower pot, sink, +nightstand, etc is still far from satisfactory. This is due to the fact that +the adapter of CLIP-based models is trained using randomly sampled N-way K-shot +data in the standard supervised learning setup. In this paper, we propose a +novel meta-episodic learning framework for CLIP-based point cloud +classification, addressing the challenges of limited training examples and +sampling unknown classes. Additionally, we introduce dynamic task sampling +within the episode based on performance memory. This sampling strategy +effectively addresses the challenge of sampling unknown classes, ensuring that +the model learns from a diverse range of classes and promotes the exploration +of underrepresented categories. By dynamically updating the performance memory, +we adaptively prioritize the sampling of classes based on their performance, +enhancing the model's ability to handle challenging and real-world scenarios. +Experiments show an average performance gain of 3-6\% on ModelNet40 and +ScanobjectNN datasets in a few-shot setup.",cs.CV,['cs.CV'] +OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers,Han Liang · Jiacheng Bao · Ruichi Zhang · Sihan Ren · Yuecheng Xu · Sibei Yang · Xin Chen · Jingyi Yu · Lan Xu, ,https://arxiv.org/abs/2312.08985v3,,2312.08985v3.pdf,OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers,"We have recently seen tremendous progress in realistic text-to-motion +generation. Yet, the existing methods often fail or produce implausible motions +with unseen text inputs, which limits the applications. In this paper, we +present OMG, a novel framework, which enables compelling motion generation from +zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the +pretrain-then-finetune paradigm into the text-to-motion generation. At the +pre-training stage, our model improves the generation ability by learning the +rich out-of-domain inherent motion traits. To this end, we scale up a large +unconditional diffusion model up to 1B parameters, so as to utilize the massive +unlabeled motion data up to over 20M motion instances. At the subsequent +fine-tuning stage, we introduce motion ControlNet, which incorporates text +prompts as conditioning information, through a trainable copy of the +pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block. +MoC block adaptively recognizes various ranges of the sub-motions with a +cross-attention mechanism and processes them separately with the +text-token-specific experts. Such a design effectively aligns the CLIP token +embeddings of text prompts to various ranges of compact and expressive motion +features. Extensive experiments demonstrate that our OMG achieves significant +improvements over the state-of-the-art methods on zero-shot text-to-motion +generation. Project page: https://tr3e.github.io/omg-page.",cs.CV,['cs.CV'] +FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning,Gihun Lee · Minchan Jeong · SangMook Kim · Jaehoon Oh · Se-Young Yun, ,https://arxiv.org/abs/2308.12532v6,,2308.12532v6.pdf,FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning,"Federated Learning (FL) aggregates locally trained models from individual +clients to construct a global model. While FL enables learning a model with +data privacy, it often suffers from significant performance degradation when +clients have heterogeneous data distributions. This data heterogeneity causes +the model to forget the global knowledge acquired from previously sampled +clients after being trained on local datasets. Although the introduction of +proximal objectives in local updates helps to preserve global knowledge, it can +also hinder local learning by interfering with local objectives. To address +this problem, we propose a novel method, Federated Stabilized Orthogonal +Learning (FedSOL), which adopts an orthogonal learning strategy to balance the +two conflicting objectives. FedSOL is designed to identify gradients of local +objectives that are inherently orthogonal to directions affecting the proximal +objective. Specifically, FedSOL targets parameter regions where learning on the +local objective is minimally influenced by proximal weight perturbations. Our +experiments demonstrate that FedSOL consistently achieves state-of-the-art +performance across various scenarios.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation,Ziyi Chen · Xiaolong Wu · Yu Zhang, ,https://arxiv.org/abs/2405.00340,,2405.00340.pdf,NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation,"State-of-the-art neural implicit surface representations have achieved +impressive results in indoor scene reconstruction by incorporating monocular +geometric priors as additional supervision. However, we have observed that +multi-view inconsistency between such priors poses a challenge for high-quality +reconstructions. In response, we present NC-SDF, a neural signed distance field +(SDF) 3D reconstruction framework with view-dependent normal compensation (NC). +Specifically, we integrate view-dependent biases in monocular normal priors +into the neural implicit representation of the scene. By adaptively learning +and correcting the biases, our NC-SDF effectively mitigates the adverse impact +of inconsistent supervision, enhancing both the global consistency and local +details in the reconstructions. To further refine the details, we introduce an +informative pixel sampling strategy to pay more attention to intricate geometry +with higher information content. Additionally, we design a hybrid geometry +modeling approach to improve the neural implicit representation. Experiments on +synthetic and real-world datasets demonstrate that NC-SDF outperforms existing +approaches in terms of reconstruction quality.",cs.CV,['cs.CV'] +GLID: Pre-training a Generalist Encoder-Decoder Vision Model,Jihao Liu · Jinliang Zheng · Yu Liu · Hongsheng Li,https://arxiv.org/abs/2404.07603,https://arxiv.org/abs/2404.07603,,2404.07603.pdf,GLID: Pre-training a Generalist Encoder-Decoder Vision Model,"This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method +for better handling various downstream computer vision tasks. While +self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown +success in transfer learning, task-specific sub-architectures are still +required to be appended for different downstream tasks, which cannot enjoy the +benefits of large-scale pre-training. GLID overcomes this challenge by allowing +the pre-trained generalist encoder-decoder to be fine-tuned on various vision +tasks with minimal task-specific architecture modifications. In the GLID +training scheme, pre-training pretext task and other downstream tasks are +modeled as ""query-to-answer"" problems, including the pre-training pretext task +and other downstream tasks. We pre-train a task-agnostic encoder-decoder with +query-mask pairs. During fine-tuning, GLID maintains the pre-trained +encoder-decoder and queries, only replacing the topmost linear transformation +layer with task-specific linear heads. This minimizes the pretrain-finetune +architecture inconsistency and enables the pre-trained model to better adapt to +downstream tasks. GLID achieves competitive performance on various vision +tasks, including object detection, image segmentation, pose estimation, and +depth estimation, outperforming or matching specialist models such as +Mask2Former, DETR, ViTPose, and BinsFormer.",cs.CV,['cs.CV'] +Your Transferability Barrier is Fragile: Free-Lunch for Transferring the Non-Transferable Learning,Ziming Hong · Li Shen · Tongliang Liu, ,,https://openreview.net/forum?id=FYKVPOHCpE,,,,,nan +Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining,Qi Cui · Ruohan Meng · Chaohui Xu · Chip Hong Chang,https://github.com/TracyCuiq/Steganographic-Passport,https://arxiv.org/abs/2404.02889,,2404.02889.pdf,Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining,"Ensuring the legal usage of deep models is crucial to promoting trustable, +accountable, and responsible artificial intelligence innovation. Current +passport-based methods that obfuscate model functionality for license-to-use +and ownership verifications suffer from capacity and quality constraints, as +they require retraining the owner model for new users. They are also vulnerable +to advanced Expanded Residual Block ambiguity attacks. We propose +Steganographic Passport, which uses an invertible steganographic network to +decouple license-to-use from ownership verification by hiding the user's +identity images into the owner-side passport and recovering them from their +respective user-side passports. An irreversible and collision-resistant hash +function is used to avoid exposing the owner-side passport from the derived +user-side passports and increase the uniqueness of the model signature. To +safeguard both the passport and model's weights against advanced ambiguity +attacks, an activation-level obfuscation is proposed for the verification +branch of the owner's model. By jointly training the verification and +deployment branches, their weights become tightly coupled. The proposed method +supports agile licensing of deep models by providing a strong ownership proof +and license accountability without requiring a separate model retraining for +the admission of every new user. Experiment results show that our +Steganographic Passport outperforms other passport-based deep model protection +methods in robustness against various known attacks.",cs.CR,"['cs.CR', 'cs.CV']" +NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation,Vikas Thamizharasan · Difan Liu · Matthew Fisher · Nanxuan Zhao · Evangelos Kalogerakis · Michal Lukáč, ,https://arxiv.org/abs/2405.15217,,2405.15217.pdf,NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation,"The success of denoising diffusion models in representing rich data +distributions over 2D raster images has prompted research on extending them to +other data representations, such as vector graphics. Unfortunately due to their +variable structure and scarcity of vector training data, directly applying +diffusion models on this domain remains a challenging problem. Using +workarounds like optimization via Score Distillation Sampling (SDS) is also +fraught with difficulty, as vector representations are non trivial to directly +optimize and tend to result in implausible geometries such as redundant or +self-intersecting shapes. NIVeL addresses these challenges by reinterpreting +the problem on an alternative, intermediate domain which preserves the +desirable properties of vector graphics -- mainly sparsity of representation +and resolution-independence. This alternative domain is based on neural +implicit fields expressed in a set of decomposable, editable layers. Based on +our experiments, NIVeL produces text-to-vector graphics results of +significantly better quality than the state-of-the-art.",cs.CV,"['cs.CV', 'cs.GR']" +GlitchBench: Can large multimodal models detect video game glitches?,Mohammad Reza Taesiri · Tianjun Feng · Cor-Paul Bezemer · Anh Nguyen, ,https://arxiv.org/abs/2312.05291,,2312.05291.pdf,GlitchBench: Can large multimodal models detect video game glitches?,"Large multimodal models (LMMs) have evolved from large language models (LLMs) +to integrate multiple input modalities, such as visual inputs. This integration +augments the capacity of LLMs for tasks requiring visual comprehension and +reasoning. However, the extent and limitations of their enhanced abilities are +not fully understood, especially when it comes to real-world tasks. To address +this gap, we introduce GlitchBench, a novel benchmark derived from video game +quality assurance tasks, to test and evaluate the reasoning capabilities of +LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios +from video games and aims to challenge both the visual and linguistic reasoning +powers of LMMs in detecting and interpreting out-of-the-ordinary events. We +evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents +a new challenge for these models. Code and data are available at: +https://glitchbench.github.io/",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions,Chunlong Xia · Xinliang Wang · Feng Lv · Xin Hao · Yifeng Shi, ,https://arxiv.org/abs/2403.07392,,2403.07392.pdf,ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions,"Although Vision Transformer (ViT) has achieved significant success in +computer vision, it does not perform well in dense prediction tasks due to the +lack of inner-patch information interaction and the limited diversity of +feature scale. Most existing studies are devoted to designing vision-specific +transformers to solve the above problems, which introduce additional +pre-training costs. Therefore, we present a plain, pre-training-free, and +feature-enhanced ViT backbone with Convolutional Multi-scale feature +interaction, named ViT-CoMer, which facilitates bidirectional interaction +between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has +the following advantages: (1) We inject spatial pyramid multi-receptive field +convolutional features into the ViT architecture, which effectively alleviates +the problems of limited local information interaction and single-feature +representation in ViT. (2) We propose a simple and efficient CNN-Transformer +bidirectional fusion interaction module that performs multi-scale fusion across +hierarchical features, which is beneficial for handling dense prediction tasks. +(3) We evaluate the performance of ViT-CoMer across various dense prediction +tasks, different frameworks, and multiple advanced pre-training. Notably, our +ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and +62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art +methods. We hope ViT-CoMer can serve as a new backbone for dense prediction +tasks to facilitate future research. The code will be released at +https://github.com/Traffic-X/ViT-CoMer.",cs.CV,['cs.CV'] +LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis,Zehan Zheng · Fan Lu · Weiyi Xue · Guang Chen · Changjun Jiang,https://dyfcalid.github.io/LiDAR4D,https://arxiv.org/abs/2404.02742,,2404.02742.pdf,LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis,"Although neural radiance fields (NeRFs) have achieved triumphs in image novel +view synthesis (NVS), LiDAR NVS remains largely unexplored. Previous LiDAR NVS +methods employ a simple shift from image NVS methods while ignoring the dynamic +nature and the large-scale reconstruction problem of LiDAR point clouds. In +light of this, we propose LiDAR4D, a differentiable LiDAR-only framework for +novel space-time LiDAR view synthesis. In consideration of the sparsity and +large-scale characteristics, we design a 4D hybrid representation combined with +multi-planar and grid features to achieve effective reconstruction in a +coarse-to-fine manner. Furthermore, we introduce geometric constraints derived +from point clouds to improve temporal consistency. For the realistic synthesis +of LiDAR point clouds, we incorporate the global optimization of ray-drop +probability to preserve cross-region patterns. Extensive experiments on +KITTI-360 and NuScenes datasets demonstrate the superiority of our method in +accomplishing geometry-aware and time-consistent dynamic reconstruction. Codes +are available at https://github.com/ispc-lab/LiDAR4D.",cs.CV,['cs.CV'] +AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation,Taeckyung Lee · Sorn Chottananurak · Taesik Gong · Sung-Ju Lee,https://nmsl.kaist.ac.kr/projects/aetta/,https://arxiv.org/abs/2404.01351,,2404.01351.pdf,AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation,"Test-time adaptation (TTA) has emerged as a viable solution to adapt +pre-trained models to domain shifts using unlabeled test data. However, TTA +faces challenges of adaptation failures due to its reliance on blind adaptation +to unknown test samples in dynamic scenarios. Traditional methods for +out-of-distribution performance estimation are limited by unrealistic +assumptions in the TTA context, such as requiring labeled data or re-training +models. To address this issue, we propose AETTA, a label-free accuracy +estimation algorithm for TTA. We propose the prediction disagreement as the +accuracy estimate, calculated by comparing the target model prediction with +dropout inferences. We then improve the prediction disagreement to extend the +applicability of AETTA under adaptation failures. Our extensive evaluation with +four baselines and six TTA methods demonstrates that AETTA shows an average of +19.8%p more accurate estimation compared with the baselines. We further +demonstrate the effectiveness of accuracy estimation with a model recovery case +study, showcasing the practicality of our model recovery based on accuracy +estimation. The source code is available at https://github.com/taeckyung/AETTA.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +Adversarial Distillation Based on Slack Matching and Attribution Region Alignment,Shenglin Yin · Zhen Xiao · Mingxuan Song · Jieyi Long, ,https://arxiv.org/abs/2312.08912,,2312.08912.pdf,Dataset Distillation via Adversarial Prediction Matching,"Dataset distillation is the technique of synthesizing smaller condensed +datasets from large original datasets while retaining necessary information to +persist the effect. In this paper, we approach the dataset distillation problem +from a novel perspective: we regard minimizing the prediction discrepancy on +the real data distribution between models, which are respectively trained on +the large original dataset and on the small distilled dataset, as a conduit for +condensing information from the raw data into the distilled version. An +adversarial framework is proposed to solve the problem efficiently. In contrast +to existing distillation methods involving nested optimization or long-range +gradient unrolling, our approach hinges on single-level optimization. This +ensures the memory efficiency of our method and provides a flexible tradeoff +between time and memory budgets, allowing us to distil ImageNet-1K using a +minimum of only 6.5GB of GPU memory. Under the optimal tradeoff strategy, it +requires only 2.5$\times$ less memory and 5$\times$ less runtime compared to +the state-of-the-art. Empirically, our method can produce synthetic datasets +just 10% the size of the original, yet achieve, on average, 94% of the test +accuracy of models trained on the full original datasets including ImageNet-1K, +significantly surpassing state-of-the-art. Additionally, extensive tests reveal +that our distilled datasets excel in cross-architecture generalization +capabilities.",cs.CV,['cs.CV'] +ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering,Haokai Pang · Heming Zhu · Adam Kortylewski · Christian Theobalt · Marc Habermann,https://vcai.mpi-inf.mpg.de/projects/ash/,https://arxiv.org/abs/2312.05941,,2312.05941.pdf,ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering,"Real-time rendering of photorealistic and controllable human avatars stands +as a cornerstone in Computer Vision and Graphics. While recent advances in +neural implicit rendering have unlocked unprecedented photorealism for digital +avatars, real-time performance has mostly been demonstrated for static scenes +only. To address this, we propose ASH, an animatable Gaussian splatting +approach for photorealistic rendering of dynamic humans in real-time. We +parameterize the clothed human as animatable 3D Gaussians, which can be +efficiently splatted into image space to generate the final rendering. However, +naively learning the Gaussian parameters in 3D space poses a severe challenge +in terms of compute. Instead, we attach the Gaussians onto a deformable +character model, and learn their parameters in 2D texture space, which allows +leveraging efficient 2D convolutional architectures that easily scale with the +required number of Gaussians. We benchmark ASH with competing methods on +pose-controllable avatars, demonstrating that our method outperforms existing +real-time methods by a large margin and shows comparable or even better results +than offline methods.",cs.CV,['cs.CV'] +Design2Cloth: 3D Cloth Generation from 2D Masks,Jiali Zheng · Rolandos Alexandros Potamias · Stefanos Zafeiriou, ,https://arxiv.org/abs/2404.02686,,2404.02686.pdf,Design2Cloth: 3D Cloth Generation from 2D Masks,"In recent years, there has been a significant shift in the field of digital +avatar research, towards modeling, animating and reconstructing clothed human +representations, as a key step towards creating realistic avatars. However, +current 3D cloth generation methods are garment specific or trained completely +on synthetic data, hence lacking fine details and realism. In this work, we +make a step towards automatic realistic garment design and propose +Design2Cloth, a high fidelity 3D generative model trained on a real world +dataset from more than 2000 subject scans. To provide vital contribution to the +fashion industry, we developed a user-friendly adversarial model capable of +generating diverse and detailed clothes simply by drawing a 2D cloth mask. +Under a series of both qualitative and quantitative experiments, we showcase +that Design2Cloth outperforms current state-of-the-art cloth generative models +by a large margin. In addition to the generative properties of our network, we +showcase that the proposed method can be used to achieve high quality +reconstructions from single in-the-wild images and 3D scans. Dataset, code and +pre-trained model will become publicly available.",cs.CV,['cs.CV'] +Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer,Wenqiao Zhang · Zheqi Lv, ,https://arxiv.org/abs/2311.12905,,2311.12905.pdf,Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer,"Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a +new target domain by actively selecting a limited number of target data to +annotate.This setting neglects the more practical scenario where training data +are collected from multiple sources. This motivates us to target a new and +challenging setting of knowledge transfer that extends ADA from a single source +domain to multiple source domains, termed Multi-source Active Domain Adaptation +(MADA). Not surprisingly, we find that most traditional ADA methods cannot work +directly in such a setting, mainly due to the excessive domain gap introduced +by all the source domains and thus their uncertainty-aware sample selection can +easily become miscalibrated under the multi-domain shifts. Considering this, we +propose a Dynamic integrated uncertainty valuation framework(Detective) that +comprehensively consider the domain shift between multi-source domains and +target domain to detect the informative target samples. Specifically, the +leverages a dynamic Domain Adaptation(DA) model that learns how to adapt the +model's parameters to fit the union of multi-source domains. This enables an +approximate single-source domain modeling by the dynamic model. We then +comprehensively measure both domain uncertainty and predictive uncertainty in +the target domain to detect informative target samples using evidential deep +learning, thereby mitigating uncertainty miscalibration. Furthermore, we +introduce a contextual diversity-aware calculator to enhance the diversity of +the selected samples. Experiments demonstrate that our solution outperforms +existing methods by a considerable margin on three domain adaptation +benchmarks.",cs.AI,"['cs.AI', 'cs.LG']" +Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing,Dongyoung Kim · Jinwoo Kim · Junsang Yu · Seon Joo Kim,https://www.dykim.me/projects/aid,https://arxiv.org/abs/2402.18277,,2402.18277.pdf,Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing,"White balance (WB) algorithms in many commercial cameras assume single and +uniform illumination, leading to undesirable results when multiple lighting +sources with different chromaticities exist in the scene. Prior research on +multi-illuminant WB typically predicts illumination at the pixel level without +fully grasping the scene's actual lighting conditions, including the number and +color of light sources. This often results in unnatural outcomes lacking in +overall consistency. To handle this problem, we present a deep white balancing +model that leverages the slot attention, where each slot is in charge of +representing individual illuminants. This design enables the model to generate +chromaticities and weight maps for individual illuminants, which are then fused +to compose the final illumination map. Furthermore, we propose the +centroid-matching loss, which regulates the activation of each slot based on +the color range, thereby enhancing the model to separate illumination more +effectively. Our method achieves the state-of-the-art performance on both +single- and multi-illuminant WB benchmarks, and also offers additional +information such as the number of illuminants in the scene and their +chromaticity. This capability allows for illumination editing, an application +not feasible with prior methods.",cs.CV,['cs.CV'] +Vista-LLaMA: Reliable Video Teller via Equal Distance to Visual Tokens,Fan Ma · Xiaojie Jin · Heng Wang · Yuchen Xian · Jiashi Feng · Yi Yang, ,https://arxiv.org/abs/2312.08870,,2312.08870.pdf,Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens,"Recent advances in large video-language models have displayed promising +outcomes in video comprehension. Current approaches straightforwardly convert +video into language tokens and employ large language models for multi-modal +tasks. However, this method often leads to the generation of irrelevant +content, commonly known as ""hallucination"", as the length of the text increases +and the impact of the video diminishes. To address this problem, we propose +Vista-LLaMA, a novel framework that maintains the consistent distance between +all visual tokens and any language tokens, irrespective of the generated text +length. Vista-LLaMA omits relative position encoding when determining attention +weights between visual and text tokens, retaining the position encoding for +text and text tokens. This amplifies the effect of visual tokens on text +generation, especially when the relative distance is longer between visual and +text tokens. The proposed attention mechanism significantly reduces the chance +of producing irrelevant text related to the video content. Furthermore, we +present a sequential visual projector that projects the current video frame +into tokens of language space with the assistance of the previous frame. This +approach not only captures the temporal relationship within the video, but also +allows less visual tokens to encompass the entire video. Our approach +significantly outperforms various previous methods (e.g., Video-ChatGPT, +MovieChat) on four challenging open-ended video question answering benchmarks. +We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot +MSRVTT-QA, setting a new state-of-the-art performance. This project is +available at https://jinxxian.github.io/Vista-LLaMA.",cs.CV,['cs.CV'] +Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images,JungEun Kim · Hangyul Yoon · Geondo Park · Kyungsu Kim · Eunho Yang, ,https://arxiv.org/abs/2404.01464,,2404.01464.pdf,Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images,"4D medical images, which represent 3D images with temporal information, are +crucial in clinical practice for capturing dynamic changes and monitoring +long-term disease progression. However, acquiring 4D medical images poses +challenges due to factors such as radiation exposure and imaging duration, +necessitating a balance between achieving high temporal resolution and +minimizing adverse effects. Given these circumstances, not only is data +acquisition challenging, but increasing the frame rate for each dataset also +proves difficult. To address this challenge, this paper proposes a simple yet +effective Unsupervised Volumetric Interpolation framework, UVI-Net. This +framework facilitates temporal interpolation without the need for any +intermediate frames, distinguishing it from the majority of other existing +unsupervised methods. Experiments on benchmark datasets demonstrate significant +improvements across diverse evaluation metrics compared to unsupervised and +supervised baselines. Remarkably, our approach achieves this superior +performance even when trained with a dataset as small as one, highlighting its +exceptional robustness and efficiency in scenarios with sparse supervision. +This positions UVI-Net as a compelling alternative for 4D medical imaging, +particularly in settings where data availability is limited. The source code is +available at https://github.com/jungeun122333/UVI-Net.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV', 'cs.LG']" +ERMVP: Communication-Efficient and Collaboration-Robust Multi-Vehicle Perception in Challenging Environments,Jingyu Zhang · Kun Yang · Yilei Wang · Hanqi Wang · Peng Sun · Liang Song, ,https://arxiv.org/abs/2307.13929v3,,2307.13929v3.pdf,Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception,"Multi-agent collaborative perception as a potential application for +vehicle-to-everything communication could significantly improve the perception +performance of autonomous vehicles over single-agent perception. However, +several challenges remain in achieving pragmatic information sharing in this +emerging research. In this paper, we propose SCOPE, a novel collaborative +perception framework that aggregates the spatio-temporal awareness +characteristics across on-road agents in an end-to-end manner. Specifically, +SCOPE has three distinct strengths: i) it considers effective semantic cues of +the temporal context to enhance current representations of the target agent; +ii) it aggregates perceptually critical spatial information from heterogeneous +agents and overcomes localization errors via multi-scale feature interactions; +iii) it integrates multi-source representations of the target agent based on +their complementary contributions by an adaptive fusion paradigm. To thoroughly +evaluate SCOPE, we consider both real-world and simulated scenarios of +collaborative 3D object detection tasks on three datasets. Extensive +experiments demonstrate the superiority of our approach and the necessity of +the proposed components.",cs.CV,['cs.CV'] +HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding,Trong-Thuan Nguyen · Pha Nguyen · Khoa Luu,https://uark-cviu.github.io/ASPIRe/,https://arxiv.org/abs/2312.03050,,2312.03050.pdf,HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding,"Visual interactivity understanding within visual scenes presents a +significant challenge in computer vision. Existing methods focus on complex +interactivities while leveraging a simple relationship model. These methods, +however, struggle with a diversity of appearance, situation, position, +interaction, and relation in videos. This limitation hinders the ability to +fully comprehend the interplay within the complex visual dynamics of subjects. +In this paper, we delve into interactivities understanding within visual +content by deriving scene graph representations from dense interactivities +among humans and objects. To achieve this goal, we first present a new dataset +containing Appearance-Situation-Position-Interaction-Relation predicates, named +ASPIRe, offering an extensive collection of videos marked by a wide range of +interactivities. Then, we propose a new approach named Hierarchical +Interlacement Graph (HIG), which leverages a unified layer and graph within a +hierarchical structure to provide deep insights into scene changes across five +distinct tasks. Our approach demonstrates superior performance to other methods +through extensive experiments conducted in various scenarios.",cs.CV,['cs.CV'] +A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions,Jack Urbanek · Florian Bordes · Pietro Astolfi · Mary Williamson · Vasu Sharma · Adriana Romero-Soriano, ,https://arxiv.org/abs/2312.08578,,2312.08578.pdf,A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions,"Curation methods for massive vision-language datasets trade off between +dataset size and quality. However, even the highest quality of available +curated captions are far too short to capture the rich visual detail in an +image. To show the value of dense and highly-aligned image-text pairs, we +collect the Densely Captioned Images (DCI) dataset, containing 8012 natural +images human-annotated with mask-aligned descriptions averaging above 1000 +words each. With precise and reliable captions associated with specific parts +of an image, we can evaluate vision-language models' (VLMs) understanding of +image content with a novel task that matches each caption with its +corresponding subcrop. As current models are often limited to 77 text tokens, +we also introduce a summarized version (sDCI) in which each caption length is +limited. We show that modern techniques that make progress on standard +benchmarks do not correspond with significant improvement on our sDCI based +benchmark. Lastly, we finetune CLIP using sDCI and show significant +improvements over the baseline despite a small training set. By releasing the +first human annotated dense image captioning dataset, we hope to enable the +development of new benchmarks or fine-tuning recipes for the next generation of +VLMs to come.",cs.CV,['cs.CV'] +Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion,Lucas Nunes · Rodrigo Marcuzzi · Benedikt Mersch · Jens Behley · Cyrill Stachniss,https://github.com/PRBonn/LiDiff,https://arxiv.org/html/2403.13470v1,,2403.13470v1.pdf,Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion,"Computer vision techniques play a central role in the perception stack of +autonomous vehicles. Such methods are employed to perceive the vehicle +surroundings given sensor data. 3D LiDAR sensors are commonly used to collect +sparse 3D point clouds from the scene. However, compared to human perception, +such systems struggle to deduce the unseen parts of the scene given those +sparse point clouds. In this matter, the scene completion task aims at +predicting the gaps in the LiDAR measurements to achieve a more complete scene +representation. Given the promising results of recent diffusion models as +generative models for images, we propose extending them to achieve scene +completion from a single 3D LiDAR scan. Previous works used diffusion models +over range images extracted from LiDAR data, directly applying image-based +diffusion methods. Distinctly, we propose to directly operate on the points, +reformulating the noising and denoising diffusion process such that it can +efficiently work at scene scale. Together with our approach, we propose a +regularization loss to stabilize the noise predicted during the denoising +process. Our experimental evaluation shows that our method can complete the +scene given a single LiDAR scan as input, producing a scene with more details +compared to state-of-the-art scene completion methods. We believe that our +proposed diffusion process formulation can support further research in +diffusion models applied to scene-scale point cloud data.",cs.CV,['cs.CV'] +Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis,Willi Menapace · Aliaksandr Siarohin · Ivan Skorokhodov · Ekaterina Deyneka · Tsai-Shien Chen · Anil Kag · Yuwei Fang · Aleksei Stoliar · Elisa Ricci · Jian Ren · Sergey Tulyakov, ,https://arxiv.org/abs/2402.14797,,2402.14797.pdf,Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis,"Contemporary models for generating images show remarkable quality and +versatility. Swayed by these advantages, the research community repurposes them +to generate videos. Since video content is highly redundant, we argue that +naively bringing advances of image models to the video generation domain +reduces motion fidelity, visual quality and impairs scalability. In this work, +we build Snap Video, a video-first model that systematically addresses these +challenges. To do that, we first extend the EDM framework to take into account +spatially and temporally redundant pixels and naturally support video +generation. Second, we show that a U-Net - a workhorse behind image generation +- scales poorly when generating videos, requiring significant computational +overhead. Hence, we propose a new transformer-based architecture that trains +3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us +to efficiently train a text-to-video model with billions of parameters for the +first time, reach state-of-the-art results on a number of benchmarks, and +generate videos with substantially higher quality, temporal consistency, and +motion complexity. The user studies showed that our model was favored by a +large margin over the most recent methods. See our website at +https://snap-research.github.io/snapvideo/.",cs.CV,"['cs.CV', 'cs.AI']" +ADFactory: An Effective Framework for Generalizing Optical Flow with NeRF,Han Ling · Quansen Sun · Yinghui Sun · Xian Xu · Xingfeng Li, ,https://arxiv.org/abs/2311.04246,,2311.04246.pdf,ADFactory: An Effective Framework for Generalizing Optical Flow with Nerf,"A significant challenge facing current optical flow methods is the difficulty +in generalizing them well to the real world. This is mainly due to the high +cost of hand-crafted datasets, and existing self-supervised methods are limited +by indirect loss and occlusions, resulting in fuzzy outcomes. To address this +challenge, we introduce a novel optical flow training framework: automatic data +factory (ADF). ADF only requires RGB images as input to effectively train the +optical flow network on the target data domain. Specifically, we use advanced +Nerf technology to reconstruct scenes from photo groups collected by a +monocular camera, and then calculate optical flow labels between camera pose +pairs based on the rendering results. To eliminate erroneous labels caused by +defects in the scene reconstructed by Nerf, we screened the generated labels +from multiple aspects, such as optical flow matching accuracy, radiation field +confidence, and depth consistency. The filtered labels can be directly used for +network supervision. Experimentally, the generalization ability of ADF on KITTI +surpasses existing self-supervised optical flow and monocular scene flow +algorithms. In addition, ADF achieves impressive results in real-world +zero-point generalization evaluations and surpasses most supervised methods.",cs.CV,['cs.CV'] +Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection,Chengjie Wang · wenbing zhu · Bin-Bin Gao · Zhenye Gan · Jiangning Zhang · Zhihao Gu · Bruce Qian · Mingang Chen · Lizhuang Ma, ,https://arxiv.org/abs/2403.12580,,2403.12580.pdf,Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection,"Industrial anomaly detection (IAD) has garnered significant attention and +experienced rapid development. However, the recent development of IAD approach +has encountered certain difficulties due to dataset limitations. On the one +hand, most of the state-of-the-art methods have achieved saturation (over 99% +in AUROC) on mainstream datasets such as MVTec, and the differences of methods +cannot be well distinguished, leading to a significant gap between public +datasets and actual application scenarios. On the other hand, the research on +various new practical anomaly detection settings is limited by the scale of the +dataset, posing a risk of overfitting in evaluation results. Therefore, we +propose a large-scale, Real-world, and multi-view Industrial Anomaly Detection +dataset, named Real-IAD, which contains 150K high-resolution images of 30 +different objects, an order of magnitude larger than existing datasets. It has +a larger range of defect area and ratio proportions, making it more challenging +than previous datasets. To make the dataset closer to real application +scenarios, we adopted a multi-view shooting method and proposed sample-level +evaluation metrics. In addition, beyond the general unsupervised anomaly +detection setting, we propose a new setting for Fully Unsupervised Industrial +Anomaly Detection (FUIAD) based on the observation that the yield rate in +industrial production is usually greater than 60%, which has more practical +application value. Finally, we report the results of popular IAD methods on the +Real-IAD dataset, providing a highly challenging benchmark to promote the +development of the IAD field.",cs.CV,['cs.CV'] +Multiview Aerial Visual RECognition (MAVREC) Dataset: Can Multi-view Improve Aerial Visual Perception?,Aritra Dutta · Srijan Das · Jacob Nielsen · RAJATSUBHRA CHAKRABORTY · Mubarak Shah, ,https://arxiv.org/abs/2312.04548,,2312.04548.pdf,Multiview Aerial Visual Recognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?,"Despite the commercial abundance of UAVs, aerial data acquisition remains +challenging, and the existing Asia and North America-centric open-source UAV +datasets are small-scale or low-resolution and lack diversity in scene +contextuality. Additionally, the color content of the scenes, solar-zenith +angle, and population density of different geographies influence the data +diversity. These two factors conjointly render suboptimal aerial-visual +perception of the deep neural network (DNN) models trained primarily on the +ground-view data, including the open-world foundational models. + To pave the way for a transformative era of aerial detection, we present +Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record +synchronized scenes from different perspectives -- ground camera and +drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard +2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million +annotated bounding boxes. This makes MAVREC the largest ground and aerial-view +dataset, and the fourth largest among all drone-based datasets across all +modalities and tasks. Through our extensive benchmarking on MAVREC, we +recognize that augmenting object detectors with ground-view images from the +corresponding geographical location is a superior pre-training strategy for +aerial detection. Building on this strategy, we benchmark MAVREC with a +curriculum-based semi-supervised object detection approach that leverages +labeled (ground and aerial) and unlabeled (only aerial) images to enhance the +aerial detection. We publicly release the MAVREC dataset: +https://mavrec.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'I.4.0; I.4.8; I.5.1; I.5.4; I.2.10']" +SpatialTracker: Tracking Any 2D Pixels in 3D Space,Yuxi Xiao · Qianqian Wang · Shangzhan Zhang · Nan Xue · Sida Peng · Yujun Shen · Xiaowei Zhou, ,https://arxiv.org/abs/2404.04319,,2404.04319.pdf,SpatialTracker: Tracking Any 2D Pixels in 3D Space,"Recovering dense and long-range pixel motion in videos is a challenging +problem. Part of the difficulty arises from the 3D-to-2D projection process, +leading to occlusions and discontinuities in the 2D motion domain. While 2D +motion can be intricate, we posit that the underlying 3D motion can often be +simple and low-dimensional. In this work, we propose to estimate point +trajectories in 3D space to mitigate the issues caused by image projection. Our +method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth +estimators, represents the 3D content of each frame efficiently using a +triplane representation, and performs iterative updates using a transformer to +estimate 3D trajectories. Tracking in 3D allows us to leverage +as-rigid-as-possible (ARAP) constraints while simultaneously learning a +rigidity embedding that clusters pixels into different rigid parts. Extensive +evaluation shows that our approach achieves state-of-the-art tracking +performance both qualitatively and quantitatively, particularly in challenging +scenarios such as out-of-plane rotation.",cs.CV,['cs.CV'] +SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective,Yu-Bang Zheng · Xile Zhao · Junhua Zeng · Chao Li · Qibin Zhao · Heng-Chao Li · Ting-Zhu Huang,https://yubangzheng.github.io,,https://zhaoxile.github.io/index.html,,,,,nan +LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge,Gongwei Chen · Leyang Shen · Rui Shao · Xiang Deng · Liqiang Nie, ,https://arxiv.org/abs/2311.11860,,2311.11860.pdf,LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge,"Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability +to perceive and understand multi-modal signals. However, most of the existing +MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text +pairs, leading to insufficient extraction and reasoning of visual knowledge. To +address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal +Large Language Model (LION), which empowers the MLLM by injecting visual +knowledge in two levels. 1) Progressive incorporation of fine-grained +spatial-aware visual knowledge. We design a vision aggregator cooperated with +region-level vision-language (VL) tasks to incorporate fine-grained +spatial-aware visual knowledge into the MLLM. To alleviate the conflict between +image-level and region-level VL tasks during incorporation, we devise a +dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This +progressive incorporation scheme contributes to the mutual promotion between +these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual +evidence. We facilitate the MLLM with high-level semantic visual evidence by +leveraging diverse image tags. To mitigate the potential influence caused by +imperfect predicted tags, we propose a soft prompting method by embedding a +learnable token into the tailored text instruction. Comprehensive experiments +on several multi-modal benchmarks demonstrate the superiority of our model +(e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over +InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).",cs.CV,['cs.CV'] +LEAD: Learning Decomposition for Source-free Universal Domain Adaptation,Sanqing Qu · Tianpei Zou · Lianghua He · Florian Röhrbein · Alois Knoll · Guang Chen · Changjun Jiang,https://github.com/ispc-lab/LEAD,https://arxiv.org/abs/2403.03421,,2403.03421.pdf,LEAD: Learning Decomposition for Source-free Universal Domain Adaptation,"Universal Domain Adaptation (UniDA) targets knowledge transfer in the +presence of both covariate and label shifts. Recently, Source-free Universal +Domain Adaptation (SF-UniDA) has emerged to achieve UniDA without access to +source data, which tends to be more practical due to data protection policies. +The main challenge lies in determining whether covariate-shifted samples belong +to target-private unknown categories. Existing methods tackle this either +through hand-crafted thresholding or by developing time-consuming iterative +clustering strategies. In this paper, we propose a new idea of LEArning +Decomposition (LEAD), which decouples features into source-known and -unknown +components to identify target-private data. Technically, LEAD initially +leverages the orthogonal decomposition analysis for feature decomposition. +Then, LEAD builds instance-level decision boundaries to adaptively identify +target-private data. Extensive experiments across various UniDA scenarios have +demonstrated the effectiveness and superiority of LEAD. Notably, in the OPDA +scenario on VisDA dataset, LEAD outperforms GLC by 3.5% overall H-score and +reduces 75% time to derive pseudo-labeling decision boundaries. Besides, LEAD +is also appealing in that it is complementary to most existing methods. The +code is available at https://github.com/ispc-lab/LEAD.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations,Evonne Ng · Javier Romero · Timur Bagautdinov · Shaojie Bai · Trevor Darrell · Angjoo Kanazawa · Alexander Richard, ,https://arxiv.org/abs/2401.01885,,2401.01885.pdf,From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations,"We present a framework for generating full-bodied photorealistic avatars that +gesture according to the conversational dynamics of a dyadic interaction. Given +speech audio, we output multiple possibilities of gestural motion for an +individual, including face, body, and hands. The key behind our method is in +combining the benefits of sample diversity from vector quantization with the +high-frequency details obtained through diffusion to generate more dynamic, +expressive motion. We visualize the generated motion using highly +photorealistic avatars that can express crucial nuances in gestures (e.g. +sneers and smirks). To facilitate this line of research, we introduce a +first-of-its-kind multi-view conversational dataset that allows for +photorealistic reconstruction. Experiments show our model generates appropriate +and diverse gestures, outperforming both diffusion- and VQ-only methods. +Furthermore, our perceptual evaluation highlights the importance of +photorealism (vs. meshes) in accurately assessing subtle motion details in +conversational gestures. Code and dataset available online.",cs.CV,['cs.CV'] +Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture,Fei Wang · Dan Guo · Kun Li · Zhun Zhong · Meng Wang, ,https://arxiv.org/abs/2403.07347,,2403.07347.pdf,Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture,"Video Motion Magnification (VMM) aims to reveal subtle and imperceptible +motion information of objects in the macroscopic world. Prior methods directly +model the motion field from the Eulerian perspective by Representation Learning +that separates shape and texture or Multi-domain Learning from phase +fluctuations. Inspired by the frequency spectrum, we observe that the +low-frequency components with stable energy always possess spatial structure +and less noise, making them suitable for modeling the subtle motion field. To +this end, we present FD4MM, a new paradigm of Frequency Decoupling for Motion +Magnification with a Multi-level Isomorphic Architecture to capture multi-level +high-frequency details and a stable low-frequency structure (motion field) in +video space. Since high-frequency details and subtle motions are susceptible to +information degradation due to their inherent subtlety and unavoidable external +interference from noise, we carefully design Sparse High/Low-pass Filters to +enhance the integrity of details and motion structures, and a Sparse Frequency +Mixer to promote seamless recoupling. Besides, we innovatively design a +contrastive regularization for this task to strengthen the model's ability to +discriminate irrelevant features, reducing undesired motion magnification. +Extensive experiments on both Real-world and Synthetic Datasets show that our +FD4MM outperforms SOTA methods. Meanwhile, FD4MM reduces FLOPs by 1.63$\times$ +and boosts inference speed by 1.68$\times$ than the latest method. Our code is +available at https://github.com/Jiafei127/FD4MM.",cs.CV,['cs.CV'] +LLM-AR: When Large Language Model Meets Skeleton-Based Action Recognition,Haoxuan Qu · Yujun Cai · Jun Liu, ,https://arxiv.org/abs/2404.00532,,2404.00532.pdf,LLMs are Good Action Recognizers,"Skeleton-based action recognition has attracted lots of research attention. +Recently, to build an accurate skeleton-based action recognizer, a variety of +works have been proposed. Among them, some works use large model architectures +as backbones of their recognizers to boost the skeleton data representation +capability, while some other works pre-train their recognizers on external data +to enrich the knowledge. In this work, we observe that large language models +which have been extensively used in various natural language processing tasks +generally hold both large model architectures and rich implicit knowledge. +Motivated by this, we propose a novel LLM-AR framework, in which we investigate +treating the Large Language Model as an Action Recognizer. In our framework, we +propose a linguistic projection process to project each input action signal +(i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action +sentence''). Moreover, we also incorporate our framework with several designs +to further facilitate this linguistic projection process. Extensive experiments +demonstrate the efficacy of our proposed framework.",cs.CV,['cs.CV'] +Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis,Xin Zhou · Dingkang Liang · Wei Xu · Xingkui Zhu · Yihan Xu · Zhikang Zou · Xiang Bai, ,https://arxiv.org/abs/2403.01439,,2403.01439.pdf,Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis,"Point cloud analysis has achieved outstanding performance by transferring +point cloud pre-trained models. However, existing methods for model adaptation +usually update all model parameters, i.e., full fine-tuning paradigm, which is +inefficient as it relies on high computational costs (e.g., training GPU +memory) and massive storage space. In this paper, we aim to study +parameter-efficient transfer learning for point cloud analysis with an ideal +trade-off between task performance and parameter efficiency. To achieve this +goal, we freeze the parameters of the default pre-trained models and then +propose the Dynamic Adapter, which generates a dynamic scale for each token, +considering the token significance to the downstream task. We further +seamlessly integrate Dynamic Adapter with Prompt Tuning (DAPT) by constructing +Internal Prompts, capturing the instance-specific features for interaction. +Extensive experiments conducted on five challenging datasets demonstrate that +the proposed DAPT achieves superior performance compared to the full +fine-tuning counterparts while significantly reducing the trainable parameters +and training GPU memory by 95% and 35%, respectively. Code is available at +https://github.com/LMD0311/DAPT.",cs.CV,['cs.CV'] +Link-Context Learning for Multimodal LLMs,Yan Tai · Weichen Fan · Zhao Zhang · Ziwei Liu, ,https://arxiv.org/abs/2308.07891,,2308.07891.pdf,Link-Context Learning for Multimodal LLMs,"The ability to learn from context with novel concepts, and deliver +appropriate responses are essential in human conversations. Despite current +Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being +trained on mega-scale datasets, recognizing unseen images or understanding +novel concepts in a training-free manner remains a challenge. In-Context +Learning (ICL) explores training-free few-shot learning, where models are +encouraged to ``learn to learn"" from limited tasks and generalize to unseen +tasks. In this work, we propose link-context learning (LCL), which emphasizes +""reasoning from cause and effect"" to augment the learning capabilities of +MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal +relationship between the support set and the query set. By providing +demonstrations with causal links, LCL guides the model to discern not only the +analogy but also the underlying causal associations between data points, which +empowers MLLMs to recognize unseen images and understand novel concepts more +effectively. To facilitate the evaluation of this novel approach, we introduce +the ISEKAI dataset, comprising exclusively of unseen generated image-label +pairs designed for link-context learning. Extensive experiments show that our +LCL-MLLM exhibits strong link-context learning capabilities to novel concepts +over vanilla MLLMs. Code and data will be released at +https://github.com/isekai-portal/Link-Context-Learning.",cs.CV,"['cs.CV', 'cs.CL']" +Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models,Jingyao Xu · Yuetong Lu · Yandong Li · Siyang Lu · Dongdong Wang · Xiang Wei, ,https://arxiv.org/abs/2404.15081,,2404.15081.pdf,Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models,"Diffusion models (DMs) embark a new era of generative modeling and offer more +opportunities for efficient generating high-quality and realistic data samples. +However, their widespread use has also brought forth new challenges in model +security, which motivates the creation of more effective adversarial attackers +on DMs to understand its vulnerability. We propose CAAT, a simple but generic +and efficient approach that does not require costly training to effectively +fool latent diffusion models (LDMs). The approach is based on the observation +that cross-attention layers exhibits higher sensitivity to gradient change, +allowing for leveraging subtle perturbations on published images to +significantly corrupt the generated images. We show that a subtle perturbation +on an image can significantly impact the cross-attention layers, thus changing +the mapping between text and image during the fine-tuning of customized +diffusion models. Extensive experiments demonstrate that CAAT is compatible +with diverse diffusion models and outperforms baseline attack methods in a more +effective (more noise) and efficient (twice as fast as Anti-DreamBooth and +Mist) manner.",cs.CV,"['cs.CV', 'cs.CR', 'cs.LG']" +Robust Depth Enhancement via Polarization Prompt Fusion Tuning,Kei IKEMURA · Yiming Huang · Felix Heide · Zhaoxiang Zhang · Qifeng Chen · Chenyang Lei,https://lastbasket.github.io/PPFT/,https://arxiv.org/abs/2404.04318,,2404.04318.pdf,Robust Depth Enhancement via Polarization Prompt Fusion Tuning,"Existing depth sensors are imperfect and may provide inaccurate depth values +in challenging scenarios, such as in the presence of transparent or reflective +objects. In this work, we present a general framework that leverages +polarization imaging to improve inaccurate depth measurements from various +depth sensors. Previous polarization-based depth enhancement methods focus on +utilizing pure physics-based formulas for a single sensor. In contrast, our +method first adopts a learning-based strategy where a neural network is trained +to estimate a dense and complete depth map from polarization data and a sensor +depth map from different sensors. To further improve the performance, we +propose a Polarization Prompt Fusion Tuning (PPFT) strategy to effectively +utilize RGB-based models pre-trained on large-scale datasets, as the size of +the polarization dataset is limited to train a strong model from scratch. We +conducted extensive experiments on a public dataset, and the results +demonstrate that the proposed method performs favorably compared to existing +depth enhancement baselines. Code and demos are available at +https://lastbasket.github.io/PPFT/.",cs.CV,"['cs.CV', 'cs.AI']" +Shadows Don’t Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now,Ayush Sarkar · Hanlin Mai · Amitabh Mahapatra · David Forsyth · Svetlana Lazebnik · Anand Bhattad,https://projective-geometry.github.io,https://arxiv.org/abs/2311.17138,,2311.17138.pdf,Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now,"Generative models can produce impressively realistic images. This paper +demonstrates that generated images have geometric features different from those +of real images. We build a set of collections of generated images, prequalified +to fool simple, signal-based classifiers into believing they are real. We then +show that prequalified generated images can be identified reliably by +classifiers that only look at geometric properties. We use three such +classifiers. All three classifiers are denied access to image pixels, and look +only at derived geometric features. The first classifier looks at the +perspective field of the image, the second looks at lines detected in the +image, and the third looks at relations between detected objects and shadows. +Our procedure detects generated images more reliably than SOTA local signal +based detectors, for images from a number of distinct generators. Saliency maps +suggest that the classifiers can identify geometric problems reliably. We +conclude that current generators cannot reliably reproduce geometric properties +of real images.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']" +GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image,Chong Bao · Yinda Zhang · Yuan Li · Xiyu Zhang · Bangbang Yang · Hujun Bao · Marc Pollefeys · Guofeng Zhang · Zhaopeng Cui,https://zju3dv.github.io/geneavatar/,https://arxiv.org/abs/2404.02152,,2404.02152.pdf,GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image,"Recently, we have witnessed the explosive growth of various volumetric +representations in modeling animatable head avatars. However, due to the +diversity of frameworks, there is no practical method to support high-level +applications like 3D head avatar editing across different representations. In +this paper, we propose a generic avatar editing approach that can be +universally applied to various 3DMM driving volumetric head avatars. To achieve +this goal, we design a novel expression-aware modification generative model, +which enables lift 2D editing from a single image to a consistent 3D +modification field. To ensure the effectiveness of the generative modification +process, we develop several techniques, including an expression-dependent +modification distillation scheme to draw knowledge from the large-scale head +avatar model and 2D facial texture editing tools, implicit latent space +guidance to enhance model convergence, and a segmentation-based loss reweight +strategy for fine-grained texture inversion. Extensive experiments demonstrate +that our method delivers high-quality and consistent results across multiple +expression and viewpoints. Project page: https://zju3dv.github.io/geneavatar/",cs.CV,['cs.CV'] +MarkovGen: Structured Prediction for Efficient Text-to-Image Generation,Sadeep Jayasumana · Daniel Glasner · Srikumar Ramalingam · Andreas Veit · Ayan Chakrabarti · Sanjiv Kumar, ,https://arxiv.org/abs/2308.10997,,2308.10997.pdf,MarkovGen: Structured Prediction for Efficient Text-to-Image Generation,"Modern text-to-image generation models produce high-quality images that are +both photorealistic and faithful to the text prompts. However, this quality +comes at significant computational cost: nearly all of these models are +iterative and require running sampling multiple times with large models. This +iterative process is needed to ensure that different regions of the image are +not only aligned with the text prompt, but also compatible with each other. In +this work, we propose a light-weight approach to achieving this compatibility +between different regions of an image, using a Markov Random Field (MRF) model. +We demonstrate the effectiveness of this method on top of the latent +token-based Muse text-to-image model. The MRF richly encodes the compatibility +among image tokens at different spatial locations to improve quality and +significantly reduce the required number of Muse sampling steps. Inference with +the MRF is significantly cheaper, and its parameters can be quickly learned +through back-propagation by modeling MRF inference as a differentiable +neural-network layer. Our full model, MarkovGen, uses this proposed MRF model +to both speed up Muse by 1.5X and produce higher quality images by decreasing +undesirable image artifacts.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +LAENeRF: Local Appearance Editing for Neural Radiance Fields,Lukas Radl · Michael Steiner · Andreas Kurz · Markus Steinberger,https://r4dl.github.io/LAENeRF/,https://arxiv.org/abs/2312.09913,,2312.09913.pdf,LAENeRF: Local Appearance Editing for Neural Radiance Fields,"Due to the omnipresence of Neural Radiance Fields (NeRFs), the interest +towards editable implicit 3D representations has surged over the last years. +However, editing implicit or hybrid representations as used for NeRFs is +difficult due to the entanglement of appearance and geometry encoded in the +model parameters. Despite these challenges, recent research has shown first +promising steps towards photorealistic and non-photorealistic appearance edits. +The main open issues of related work include limited interactivity, a lack of +support for local edits and large memory requirements, rendering them less +useful in practice. We address these limitations with LAENeRF, a unified +framework for photorealistic and non-photorealistic appearance editing of +NeRFs. To tackle local editing, we leverage a voxel grid as starting point for +region selection. We learn a mapping from expected ray terminations to final +output color, which can optionally be supervised by a style loss, resulting in +a framework which can perform photorealistic and non-photorealistic appearance +editing of selected regions. Relying on a single point per ray for our mapping, +we limit memory requirements and enable fast optimization. To guarantee +interactivity, we compose the output color using a set of learned, modifiable +base colors, composed with additive layer mixing. Compared to concurrent work, +LAENeRF enables recoloring and stylization while keeping processing time low. +Furthermore, we demonstrate that our approach surpasses baseline methods both +quantitatively and qualitatively.",cs.CV,['cs.CV'] +EgoGen: An Egocentric Synthetic Data Generator,Gen Li · Kaifeng Zhao · Siwei Zhang · Xiaozhong Lyu · Mihai Dusmanu · Yan Zhang · Marc Pollefeys · Siyu Tang,https://ego-gen.github.io,https://arxiv.org/abs/2401.08739,,2401.08739.pdf,EgoGen: An Egocentric Synthetic Data Generator,"Understanding the world in first-person view is fundamental in Augmented +Reality (AR). This immersive perspective brings dramatic visual changes and +unique challenges compared to third-person views. Synthetic data has empowered +third-person-view vision models, but its application to embodied egocentric +perception tasks remains largely unexplored. A critical challenge lies in +simulating natural human movements and behaviors that effectively steer the +embodied cameras to capture a faithful egocentric representation of the 3D +world. To address this challenge, we introduce EgoGen, a new synthetic data +generator that can produce accurate and rich ground-truth training data for +egocentric perception tasks. At the heart of EgoGen is a novel human motion +synthesis model that directly leverages egocentric visual inputs of a virtual +human to sense the 3D environment. Combined with collision-avoiding motion +primitives and a two-stage reinforcement learning approach, our motion +synthesis model offers a closed-loop solution where the embodied perception and +movement of the virtual human are seamlessly coupled. Compared to previous +works, our model eliminates the need for a pre-defined global path, and is +directly applicable to dynamic environments. Combined with our easy-to-use and +scalable data generation pipeline, we demonstrate EgoGen's efficacy in three +tasks: mapping and localization for head-mounted cameras, egocentric camera +tracking, and human mesh recovery from egocentric views. EgoGen will be fully +open-sourced, offering a practical solution for creating realistic egocentric +training data and aiming to serve as a useful tool for egocentric computer +vision research. Refer to our project page: https://ego-gen.github.io/.",cs.CV,"['cs.CV', 'cs.AI']" +D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection,Dinh Phat Do · Taehoon Kim · JAEMIN NA · Jiwon Kim · Keonho LEE · Kyunghwan Cho · Wonjun Hwang,https://github.com/EdwardDo69/D3T,https://arxiv.org/abs/2403.09359,,2403.09359.pdf,D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection,"Domain adaptation for object detection typically entails transferring +knowledge from one visible domain to another visible domain. However, there are +limited studies on adapting from the visible to the thermal domain, because the +domain gap between the visible and thermal domains is much larger than +expected, and traditional domain adaptation can not successfully facilitate +learning in this situation. To overcome this challenge, we propose a +Distinctive Dual-Domain Teacher (D3T) framework that employs distinct training +paradigms for each domain. Specifically, we segregate the source and target +training sets for building dual-teachers and successively deploy exponential +moving average to the student model to individual teachers of each domain. The +framework further incorporates a zigzag learning method between dual teachers, +facilitating a gradual transition from the visible to thermal domains during +training. We validate the superiority of our method through newly designed +experimental protocols with well-known thermal datasets, i.e., FLIR and KAIST. +Source code is available at https://github.com/EdwardDo69/D3T .",cs.CV,"['cs.CV', 'cs.AI']" +Bayesian Diffusion Models for 3D Shape Reconstruction,Haiyang Xu · Yu lei · Zeyuan Chen · Xiang Zhang · Yue Zhao · Yilin Wang · Zhuowen Tu, ,https://arxiv.org/abs/2403.06973,,2403.06973.pdf,Bayesian Diffusion Models for 3D Shape Reconstruction,"We present Bayesian Diffusion Models (BDM), a prediction algorithm that +performs effective Bayesian inference by tightly coupling the top-down (prior) +information with the bottom-up (data-driven) procedure via joint diffusion +processes. We show the effectiveness of BDM on the 3D shape reconstruction +task. Compared to prototypical deep learning data-driven approaches trained on +paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM +brings in rich prior information from standalone labels (e.g. point clouds) to +improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian +frameworks where explicit prior and likelihood are required for the inference, +BDM performs seamless information fusion via coupled diffusion processes with +learned gradient computation networks. The specialty of our BDM lies in its +capability to engage the active and effective information exchange and fusion +of the top-down and bottom-up processes where each itself is a diffusion +process. We demonstrate state-of-the-art results on both synthetic and +real-world benchmarks for 3D shape reconstruction.",cs.CV,"['cs.CV', 'cs.LG']" +Domain Separation Graph Neural Networks for Saliency Object Ranking,Zijian Wu · Jun Lu · Jing Han · Lianfa Bai · Yi Zhang · Zhuang Zhao · Siyang Song, ,,https://www.nature.com/articles/s41598-024-61105-3,,,,,nan +DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision,Lu Ling · Yichen Sheng · Zhi Tu · Wentian Zhao · Cheng Xin · Kun Wan · Lantao Yu · Qianyu Guo · Zixun Yu · Yawen Lu · Xuanmao Li · Xingpeng Sun · Rohan Ashok · Aniruddha Mukherjee · Hao Kang · Xiangrui Kong · Gang Hua · Tianyi Zhang · Bedrich Benes · Aniket Bera, ,https://arxiv.org/abs/2312.16256,,2312.16256.pdf,DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision,"We have witnessed significant progress in deep learning-based 3D vision, +ranging from neural radiance field (NeRF) based 3D representation learning to +applications in novel view synthesis (NVS). However, existing scene-level +datasets for deep learning-based 3D vision, limited to either synthetic +environments or a narrow selection of real-world scenes, are quite +insufficient. This insufficiency not only hinders a comprehensive benchmark of +existing methods but also caps what could be explored in deep learning-based 3D +analysis. To address this critical gap, we present DL3DV-10K, a large-scale +scene dataset, featuring 51.2 million frames from 10,510 videos captured from +65 types of point-of-interest (POI) locations, covering both bounded and +unbounded scenes, with different levels of reflection, transparency, and +lighting. We conducted a comprehensive benchmark of recent NVS methods on +DL3DV-10K, which revealed valuable insights for future research in NVS. In +addition, we have obtained encouraging results in a pilot study to learn +generalizable NeRF from DL3DV-10K, which manifests the necessity of a +large-scale scene-level dataset to forge a path toward a foundation model for +learning 3D representation. Our DL3DV-10K dataset, benchmark results, and +models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.",cs.CV,"['cs.CV', 'cs.AI']" +Fitting Flats to Flats,Gabriel Dogadov · Ugo Finnendahl · Marc Alexa, ,,https://github.com/gdogadov,,,,,nan +MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild,Zeren Jiang · Chen Guo · Manuel Kaufmann · Tianjian Jiang · Julien Valentin · Otmar Hilliges · Jie Song, ,,https://dl.acm.org/doi/10.1145/3581783.3611978,,,,,nan +Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving,Yuqi Wang · Jiawei He · Lue Fan · Hongxin Li · Yuntao Chen · Zhaoxiang Zhang,https://drive-wm.github.io,https://arxiv.org/abs/2311.17918,,2311.17918.pdf,Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving,"In autonomous driving, predicting future events in advance and evaluating the +foreseeable risks empowers autonomous vehicles to better plan their actions, +enhancing safety and efficiency on the road. To this end, we propose Drive-WM, +the first driving world model compatible with existing end-to-end planning +models. Through a joint spatial-temporal modeling facilitated by view +factorization, our model generates high-fidelity multiview videos in driving +scenes. Building on its powerful generation ability, we showcase the potential +of applying the world model for safe driving planning for the first time. +Particularly, our Drive-WM enables driving into multiple futures based on +distinct driving maneuvers, and determines the optimal trajectory according to +the image-based rewards. Evaluation on real-world driving datasets verifies +that our method could generate high-quality, consistent, and controllable +multiview videos, opening up possibilities for real-world simulations and safe +planning.",cs.CV,['cs.CV'] +DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving,Chen Min · Dawei Zhao · Liang Xiao · Jian Zhao · Xinli Xu · Zheng Zhu · Lei Jin · Jianshu Li · Yulan Guo · Junliang Xing · Liping Jing · Yiming Nie · Bin Dai, ,,https://paperswithcode.com/paper/driveworld-4d-pre-trained-scene-understanding,,,,,nan +ZONE: Zero-Shot Instruction-Guided Local Editing,Shanglin Li · Bohan Zeng · Yutang Feng · Sicheng Gao · Xuhui Liu · Jiaming Liu · Li Lin · Xu Tang · Yao Hu · Jianzhuang Liu · Baochang Zhang, ,https://arxiv.org/abs/2312.16794,,2312.16794.pdf,ZONE: Zero-Shot Instruction-Guided Local Editing,"Recent advances in vision-language models like Stable Diffusion have shown +remarkable power in creative image synthesis and editing.However, most existing +text-to-image editing methods encounter two obstacles: First, the text prompt +needs to be carefully crafted to achieve good results, which is not intuitive +or user-friendly. Second, they are insensitive to local edits and can +irreversibly affect non-edited regions, leaving obvious editing traces. To +tackle these problems, we propose a Zero-shot instructiON-guided local image +Editing approach, termed ZONE. We first convert the editing intent from the +user-provided instruction (e.g., ""make his tie blue"") into specific image +editing regions through InstructPix2Pix. We then propose a Region-IoU scheme +for precise image layer extraction from an off-the-shelf segment model. We +further develop an edge smoother based on FFT for seamless blending between the +layer and the image.Our method allows for arbitrary manipulation of a specific +region with a single instruction while preserving the rest. Extensive +experiments demonstrate that our ZONE achieves remarkable local editing results +and user-friendliness, outperforming state-of-the-art methods. Code is +available at https://github.com/lsl001006/ZONE.",cs.CV,['cs.CV'] +DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing,Kaiwen Zhang · Yifan Zhou · Xudong XU · Bo Dai · Xingang Pan,https://kevin-thu.github.io/DiffMorpher_page,https://arxiv.org/abs/2312.07409,,2312.07409.pdf,DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing,"Diffusion models have achieved remarkable image generation quality surpassing +previous generative models. However, a notable limitation of diffusion models, +in comparison to GANs, is their difficulty in smoothly interpolating between +two image samples, due to their highly unstructured latent space. Such a smooth +interpolation is intriguing as it naturally serves as a solution for the image +morphing task with many applications. In this work, we present DiffMorpher, the +first approach enabling smooth and natural image interpolation using diffusion +models. Our key idea is to capture the semantics of the two images by fitting +two LoRAs to them respectively, and interpolate between both the LoRA +parameters and the latent noises to ensure a smooth semantic transition, where +correspondence automatically emerges without the need for annotation. In +addition, we propose an attention interpolation and injection technique and a +new sampling schedule to further enhance the smoothness between consecutive +images. Extensive experiments demonstrate that DiffMorpher achieves starkly +better image morphing effects than previous methods across a variety of object +categories, bridging a critical functional gap that distinguished diffusion +models from GANs.",cs.CV,['cs.CV'] +InstructDiffusion: A Generalist Modeling Interface for Vision Tasks,Zigang Geng · Binxin Yang · Tiankai Hang · Chen Li · Shuyang Gu · Ting Zhang · Jianmin Bao · Zheng Zhang · Houqiang Li · Han Hu · Dong Chen · Baining Guo, ,https://arxiv.org/abs/2309.03895,,2309.03895.pdf,InstructDiffusion: A Generalist Modeling Interface for Vision Tasks,"We present InstructDiffusion, a unifying and generic framework for aligning +computer vision tasks with human instructions. Unlike existing approaches that +integrate prior knowledge and pre-define the output space (e.g., categories and +coordinates) for each vision task, we cast diverse vision tasks into a +human-intuitive image-manipulating process whose output space is a flexible and +interactive pixel space. Concretely, the model is built upon the diffusion +process and is trained to predict pixels according to user instructions, such +as encircling the man's left shoulder in red or applying a blue mask to the +left car. InstructDiffusion could handle a variety of vision tasks, including +understanding tasks (such as segmentation and keypoint detection) and +generative tasks (such as editing and enhancement). It even exhibits the +ability to handle unseen tasks and outperforms prior methods on novel datasets. +This represents a significant step towards a generalist modeling interface for +vision tasks, advancing artificial general intelligence in the field of +computer vision.",cs.CV,['cs.CV'] +Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket,Chengxu Zuo · Yiming Wang · Lishuang Zhan · Shihui Guo · Xinyu Yi · Feng Xu · Yipeng Qin, ,https://arxiv.org/abs/2308.16682,,2308.16682.pdf,DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion,"Motion capture from a limited number of body-worn sensors, such as inertial +measurement units (IMUs) and pressure insoles, has important applications in +health, human performance, and entertainment. Recent work has focused on +accurately reconstructing whole-body motion from a specific sensor +configuration using six IMUs. While a common goal across applications is to use +the minimal number of sensors to achieve required accuracy, the optimal +arrangement of the sensors might differ from application to application. We +propose a single diffusion model, DiffusionPoser, which reconstructs human +motion in real-time from an arbitrary combination of sensors, including IMUs +placed at specified locations, and, pressure insoles. Unlike existing methods, +our model grants users the flexibility to determine the number and arrangement +of sensors tailored to the specific activity of interest, without the need for +retraining. A novel autoregressive inferencing scheme ensures real-time motion +reconstruction that closely aligns with measured sensor signals. The generative +nature of DiffusionPoser ensures realistic behavior, even for +degrees-of-freedom not directly measured. Qualitative results can be found on +our website: https://diffusionposer.github.io/.",cs.CV,['cs.CV'] +Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology,Andrew Song · Richard J. Chen · Tong Ding · Drew F. K. Williamson · Guillaume Jaume · Faisal Mahmood, ,https://arxiv.org/abs/2405.11643,,2405.11643.pdf,Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology,"Representation learning of pathology whole-slide images (WSIs) has been has +primarily relied on weak supervision with Multiple Instance Learning (MIL). +However, the slide representations resulting from this approach are highly +tailored to specific clinical tasks, which limits their expressivity and +generalization, particularly in scenarios with limited data. Instead, we +hypothesize that morphological redundancy in tissue can be leveraged to build a +task-agnostic slide representation in an unsupervised fashion. To this end, we +introduce PANTHER, a prototype-based approach rooted in the Gaussian mixture +model that summarizes the set of WSI patches into a much smaller set of +morphological prototypes. Specifically, each patch is assumed to have been +generated from a mixture distribution, where each mixture component represents +a morphological exemplar. Utilizing the estimated mixture parameters, we then +construct a compact slide representation that can be readily used for a wide +range of downstream tasks. By performing an extensive evaluation of PANTHER on +subtyping and survival tasks using 13 datasets, we show that 1) PANTHER +outperforms or is on par with supervised MIL baselines and 2) the analysis of +morphological prototypes brings new qualitative and quantitative insights into +model interpretability.",cs.CV,"['cs.CV', 'cs.LG', 'stat.AP']" +FairRAG: Fair Human Generation via Fair Retrieval Augmentation,Robik Shrestha · Yang Zou · Qiuyu Chen · Zhiheng Li · Yusheng Xie · Siqi Deng, ,https://arxiv.org/abs/2403.19964,,2403.19964.pdf,FairRAG: Fair Human Generation via Fair Retrieval Augmentation,"Existing text-to-image generative models reflect or even amplify societal +biases ingrained in their training data. This is especially concerning for +human image generation where models are biased against certain demographic +groups. Existing attempts to rectify this issue are hindered by the inherent +limitations of the pre-trained models and fail to substantially improve +demographic diversity. In this work, we introduce Fair Retrieval Augmented +Generation (FairRAG), a novel framework that conditions pre-trained generative +models on reference images retrieved from an external image database to improve +fairness in human generation. FairRAG enables conditioning through a +lightweight linear module that projects reference images into the textual +space. To enhance fairness, FairRAG applies simple-yet-effective debiasing +strategies, providing images from diverse demographic groups during the +generative process. Extensive experiments demonstrate that FairRAG outperforms +existing methods in terms of demographic diversity, image-text alignment, and +image fidelity while incurring minimal computational overhead during inference.",cs.CV,"['cs.CV', 'cs.CY', 'cs.LG']" +Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction,Guillaume Jaume · Anurag Vaidya · Richard J. Chen · Drew F. K. Williamson · Paul Pu Liang · Faisal Mahmood, ,https://arxiv.org/abs/2404.08027,,2404.08027.pdf,SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival Prediction,"Multi-modal learning that combines pathological images with genomic data has +significantly enhanced the accuracy of survival prediction. Nevertheless, +existing methods have not fully utilized the inherent hierarchical structure +within both whole slide images (WSIs) and transcriptomic data, from which +better intra-modal representations and inter-modal integration could be +derived. Moreover, many existing studies attempt to improve multi-modal +representations through attention mechanisms, which inevitably lead to high +complexity when processing high-dimensional WSIs and transcriptomic data. +Recently, a structured state space model named Mamba emerged as a promising +approach for its superior performance in modeling long sequences with low +complexity. In this study, we propose Mamba with multi-grained multi-modal +interaction (SurvMamba) for survival prediction. SurvMamba is implemented with +a Hierarchical Interaction Mamba (HIM) module that facilitates efficient +intra-modal interactions at different granularities, thereby capturing more +detailed local features as well as rich global representations. In addition, an +Interaction Fusion Mamba (IFM) module is used for cascaded inter-modal +interactive fusion, yielding more comprehensive features for survival +prediction. Comprehensive evaluations on five TCGA datasets demonstrate that +SurvMamba outperforms other existing methods in terms of performance and +computational cost.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'q-bio.QM']" +End-to-End Spatio-Temporal Action Localisation with Video Transformers,Alexey Gritsenko · Xuehan Xiong · Josip Djolonga · Mostafa Dehghani · Chen Sun · Mario Lučić · Cordelia Schmid · Anurag Arnab, ,,https://openreview.net/forum?id=Va4t6R8cGG,,,,,nan +MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation,Haokai Zhu · Si-Yuan Cao · Jianxin Hu · Sitong Zuo · Beinan Yu · Jiacheng Ying · Junwei Li · Hui-Liang Shen,https://github.com/zjuzhk/MCNet,,https://www.youtube.com/watch?v=mcRa7BsZrOE,,,,,nan +SODA: Bottleneck Diffusion Models for Representation Learning,Drew Hudson · Daniel Zoran · Mateusz Malinowski · Andrew Lampinen · Andrew Jaegle · James McClelland · Loic Matthey · Felix Hill · Alexander Lerchner, ,https://arxiv.org/abs/2311.17901,,2311.17901.pdf,SODA: Bottleneck Diffusion Models for Representation Learning,"We introduce SODA, a self-supervised diffusion model, designed for +representation learning. The model incorporates an image encoder, which +distills a source view into a compact representation, that, in turn, guides the +generation of related novel views. We show that by imposing a tight bottleneck +between the encoder and a denoising decoder, and leveraging novel view +synthesis as a self-supervised objective, we can turn diffusion models into +strong representation learners, capable of capturing visual semantics in an +unsupervised manner. To the best of our knowledge, SODA is the first diffusion +model to succeed at ImageNet linear-probe classification, and, at the same +time, it accomplishes reconstruction, editing and synthesis tasks across a wide +range of datasets. Further investigation reveals the disentangled nature of its +emergent latent space, that serves as an effective interface to control and +manipulate the model's produced images. All in all, we aim to shed light on the +exciting and promising potential of diffusion models, not only for image +generation, but also for learning rich and robust representations.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +EasyDrag: Efficient Point-based Manipulation on Diffusion Models,Xingzhong Hou · Boxiao Liu · Yi Zhang · Jihao Liu · Yu Liu · Haihang You, ,,https://github.com/Yujun-Shi/DragDiffusion,,,,,nan +Segment and Caption Anything,Xiaoke Huang · Jianfeng Wang · Yansong Tang · Zheng Zhang · Han Hu · Jiwen Lu · Lijuan Wang · Zicheng Liu,https://xk-huang.github.io/segment-caption-anything/,https://arxiv.org/abs/2312.00869,,2312.00869.pdf,Segment and Caption Anything,"We propose a method to efficiently equip the Segment Anything Model (SAM) +with the ability to generate regional captions. SAM presents strong +generalizability to segment anything while is short for semantic understanding. +By introducing a lightweight query-based feature mixer, we align the +region-specific features with the embedding space of language models for later +caption generation. As the number of trainable parameters is small (typically +in the order of tens of millions), it costs less computation, less memory +usage, and less communication bandwidth, resulting in both fast and scalable +training. To address the scarcity problem of regional caption data, we propose +to first pre-train our model on objection detection and segmentation tasks. We +call this step weak supervision pretraining since the pre-training data only +contains category names instead of full-sentence descriptions. The weak +supervision pretraining allows us to leverage many publicly available object +detection and segmentation datasets. We conduct extensive experiments to +demonstrate the superiority of our method and validate each design choice. This +work serves as a stepping stone towards scaling up regional captioning data and +sheds light on exploring efficient ways to augment SAM with regional semantics. +The project page, along with the associated code, can be accessed via +https://xk-huang.github.io/segment-caption-anything/.",cs.CV,['cs.CV'] +6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation,Li Xu · Haoxuan Qu · Yujun Cai · Jun Liu, ,https://arxiv.org/abs/2401.00029,,2401.00029.pdf,6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation,"Estimating the 6D object pose from a single RGB image often involves noise +and indeterminacy due to challenges such as occlusions and cluttered +backgrounds. Meanwhile, diffusion models have shown appealing performance in +generating high-quality images from random noise with high indeterminacy +through step-by-step denoising. Inspired by their denoising capability, we +propose a novel diffusion-based framework (6D-Diff) to handle the noise and +indeterminacy in object pose estimation for better performance. In our +framework, to establish accurate 2D-3D correspondence, we formulate 2D +keypoints detection as a reverse diffusion (denoising) process. To facilitate +such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion +process and condition the reverse process on the object features. Extensive +experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our +framework.",cs.CV,['cs.CV'] +UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes,David Rozenberszki · Or Litany · Angela Dai,https://rozdavid.github.io/unscene3d,https://ar5iv.labs.arxiv.org/html/2312.11557,,2312.11557.pdf,SAI3D: Segment Any Instance in 3D Scenes,"Advancements in 3D instance segmentation have traditionally been tethered to +the availability of annotated datasets, limiting their application to a narrow +spectrum of object categories. Recent efforts have sought to harness +vision-language models like CLIP for open-set semantic reasoning, yet these +methods struggle to distinguish between objects of the same categories and rely +on specific prompts that are not universally applicable. In this paper, we +introduce SAI3D, a novel zero-shot 3D instance segmentation approach that +synergistically leverages geometric priors and semantic cues derived from +Segment Anything Model (SAM). Our method partitions a 3D scene into geometric +primitives, which are then progressively merged into 3D instance segmentations +that are consistent with the multi-view SAM masks. Moreover, we design a +hierarchical region-growing algorithm with a dynamic thresholding mechanism, +which largely improves the robustness of finegrained 3D scene parsing.Empirical +evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ +datasets demonstrate the superiority of our approach. Notably, SAI3D +outperforms existing open-vocabulary baselines and even surpasses +fully-supervised methods in class-agnostic segmentation on ScanNet++. Our +project page is at https://yd-yin.github.io/SAI3D.",cs.CV,['cs.CV'] +Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation,Yi Zhang · Meng-Hao Guo · Miao Wang · Shi-Min Hu, ,https://arxiv.org/abs/2403.08426,,2403.08426.pdf,Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation,"The pre-trained vision-language model, exemplified by CLIP, advances +zero-shot semantic segmentation by aligning visual features with class +embeddings through a transformer decoder to generate semantic masks. Despite +its effectiveness, prevailing methods within this paradigm encounter +challenges, including overfitting on seen classes and small fragmentation in +masks. To mitigate these issues, we propose a Language-Driven Visual Consensus +(LDVC) approach, fostering improved alignment of semantic and visual +information.Specifically, we leverage class embeddings as anchors due to their +discrete and abstract nature, steering vision features toward class embeddings. +Moreover, to circumvent noisy alignments from the vision part due to its +redundant nature, we introduce route attention into self-attention for finding +visual consensus, thereby enhancing semantic consistency within the same +object. Equipped with a vision-language prompting strategy, our approach +significantly boosts the generalization capacity of segmentation models for +unseen classes. Experimental results underscore the effectiveness of our +approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the +COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.AI']" +Selective nonlinearities removal from digital signals,Krzysztof Maliszewski · Magdalena Urbanska · Varvara Vetrova · Sylwia Kolenderska, ,https://arxiv.org/abs/2403.09731,,2403.09731.pdf,Selective nonlinearities removal from digital signals,"Many instruments performing optical and non-optical imaging and sensing, such +as Optical Coherence Tomography (OCT), Magnetic Resonance Imaging or +Fourier-transform spectrometry, produce digital signals containing modulations, +sine-like components, which only after Fourier transformation give information +about the structure or characteristics of the investigated object. Due to the +fundamental physics-related limitations of such methods, the distribution of +these signal components is often nonlinear and, when not properly compensated, +leads to the resolution, precision or quality drop in the final image. Here, we +propose an innovative approach that has the potential to allow cleaning of the +signal from the nonlinearities but most of all, it now allows to switch the +given order off, leaving all others intact. The latter provides a tool for more +in-depth analysis of the nonlinearity-inducing properties of the investigated +object, which can lead to applications in early disease detection or more +sensitive sensing of chemical compounds. We consider OCT signals and +nonlinearities up to the third order. In our approach, we propose two neural +networks: one to remove solely the second-order nonlinearity and the other for +removing solely the third-order nonlinearity. The input of the networks is a +novel two-dimensional data structure with all the information needed for the +network to infer a nonlinearity-free signal. We describe the developed networks +and present the results for second-order and third-order nonlinearity removal +in OCT data representing the images of various objects: a mirror, glass, and +fruits.",eess.IV,"['eess.IV', 'physics.data-an', 'physics.optics']" +Efficient Model Stealing Defense with Noise Transition Matrix,Dong-Dong Wu · Chilin Fu · Weichang Wu · Wenwen Xia · Xiaolu Zhang · JUN ZHOU · Min-Ling Zhang, ,https://arxiv.org/abs/2309.01838,,2309.01838.pdf,Efficient Defense Against Model Stealing Attacks on Convolutional Neural Networks,"Model stealing attacks have become a serious concern for deep learning +models, where an attacker can steal a trained model by querying its black-box +API. This can lead to intellectual property theft and other security and +privacy risks. The current state-of-the-art defenses against model stealing +attacks suggest adding perturbations to the prediction probabilities. However, +they suffer from heavy computations and make impracticable assumptions about +the adversary. They often require the training of auxiliary models. This can be +time-consuming and resource-intensive which hinders the deployment of these +defenses in real-world applications. In this paper, we propose a simple yet +effective and efficient defense alternative. We introduce a heuristic approach +to perturb the output probabilities. The proposed defense can be easily +integrated into models without additional training. We show that our defense is +effective in defending against three state-of-the-art stealing attacks. We +evaluate our approach on large and quantized (i.e., compressed) Convolutional +Neural Networks (CNNs) trained on several vision datasets. Our technique +outperforms the state-of-the-art defenses with a $\times37$ faster inference +latency without requiring any additional model and with a low impact on the +model's performance. We validate that our defense is also effective for +quantized CNNs targeting edge devices.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CR']" +Unsupervised Universal Image Segmentation,XuDong Wang · Dantong Niu · Xinyang Han · Long Lian · Roei Herzig · Trevor Darrell, ,https://arxiv.org/abs/2312.17243,,2312.17243.pdf,Unsupervised Universal Image Segmentation,"Several unsupervised image segmentation approaches have been proposed which +eliminate the need for dense manually-annotated segmentation masks; current +models separately handle either semantic segmentation (e.g., STEGO) or +class-agnostic instance segmentation (e.g., CutLER), but not both (i.e., +panoptic segmentation). We propose an Unsupervised Universal Segmentation model +(U2Seg) adept at performing various image segmentation tasks -- instance, +semantic and panoptic -- using a novel unified framework. U2Seg generates +pseudo semantic labels for these segmentation tasks via leveraging +self-supervised models followed by clustering; each cluster represents +different semantic and/or instance membership of pixels. We then self-train the +model on these pseudo semantic labels, yielding substantial performance gains +over specialized methods tailored to each task: a +2.6 AP$^{\text{box}}$ boost +vs. CutLER in unsupervised instance segmentation on COCO and a +7.0 PixelAcc +increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff. +Moreover, our method sets up a new baseline for unsupervised panoptic +segmentation, which has not been previously explored. U2Seg is also a strong +pretrained model for few-shot segmentation, surpassing CutLER by +5.0 +AP$^{\text{mask}}$ when trained on a low-data regime, e.g., only 1% COCO +labels. We hope our simple yet effective method can inspire more research on +unsupervised universal image segmentation.",cs.CV,['cs.CV'] +HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation,Xin Huang · Ruizhi Shao · Qi Zhang · Hongwen Zhang · Ying Feng · Yebin Liu · Qing Wang,https://humannorm.github.io,https://arxiv.org/abs/2310.01406,,2310.01406.pdf,HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation,"Recent text-to-3D methods employing diffusion models have made significant +advancements in 3D human generation. However, these approaches face challenges +due to the limitations of text-to-image diffusion models, which lack an +understanding of 3D structures. Consequently, these methods struggle to achieve +high-quality human generation, resulting in smooth geometry and cartoon-like +appearances. In this paper, we propose HumanNorm, a novel approach for +high-quality and realistic 3D human generation. The main idea is to enhance the +model's 2D perception of 3D geometry by learning a normal-adapted diffusion +model and a normal-aligned diffusion model. The normal-adapted diffusion model +can generate high-fidelity normal maps corresponding to user prompts with +view-dependent and body-aware text. The normal-aligned diffusion model learns +to generate color images aligned with the normal maps, thereby transforming +physical geometry details into realistic appearance. Leveraging the proposed +normal diffusion model, we devise a progressive geometry generation strategy +and a multi-step Score Distillation Sampling (SDS) loss to enhance the +performance of 3D human generation. Comprehensive experiments substantiate +HumanNorm's ability to generate 3D humans with intricate geometry and realistic +appearances. HumanNorm outperforms existing text-to-3D methods in both geometry +and texture quality. The project page of HumanNorm is +https://humannorm.github.io/.",cs.CV,['cs.CV'] +SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation,Kejia Yin · Varshanth Rao · Ruowei Jiang · Xudong Liu · Parham Aarabi · David B. Lindell, ,https://arxiv.org/abs/2405.18322,,2405.18322.pdf,SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation,"Self-supervised landmark estimation is a challenging task that demands the +formation of locally distinct feature representations to identify sparse facial +landmarks in the absence of annotated data. To tackle this task, existing +state-of-the-art (SOTA) methods (1) extract coarse features from backbones that +are trained with instance-level self-supervised learning (SSL) paradigms, which +neglect the dense prediction nature of the task, (2) aggregate them into +memory-intensive hypercolumn formations, and (3) supervise lightweight +projector networks to naively establish full local correspondences among all +pairs of spatial features. In this paper, we introduce SCE-MAE, a framework +that (1) leverages the MAE, a region-level SSL method that naturally better +suits the landmark prediction task, (2) operates on the vanilla feature map +instead of on expensive hypercolumns, and (3) employs a Correspondence +Approximation and Refinement Block (CARB) that utilizes a simple density peak +clustering algorithm and our proposed Locality-Constrained Repellence Loss to +directly hone only select local correspondences. We demonstrate through +extensive experiments that SCE-MAE is highly effective and robust, +outperforming existing SOTA methods by large margins of approximately 20%-44% +on the landmark matching and approximately 9%-15% on the landmark detection +tasks.",cs.CV,"['cs.CV', 'cs.AI']" +Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples,Yuyang Yu · Bangzhen Liu · Chenxi Zheng · Xuemiao Xu · Huaidong Zhang · Shengfeng He,https://github.com/Yuyan9Yu/BeyondTextConstraint,https://arxiv.org/abs/2307.16424,,2307.16424.pdf,MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning,"Equipping a deep model the abaility of few-shot learning, i.e., learning +quickly from only few examples, is a core challenge for artificial +intelligence. Gradient-based meta-learning approaches effectively address the +challenge by learning how to learn novel tasks. Its key idea is learning a deep +model in a bi-level optimization manner, where the outer-loop process learns a +shared gradient descent algorithm (i.e., its hyperparameters), while the +inner-loop process leverage it to optimize a task-specific model by using only +few labeled data. Although these existing methods have shown superior +performance, the outer-loop process requires calculating second-order +derivatives along the inner optimization path, which imposes considerable +memory burdens and the risk of vanishing gradients. Drawing inspiration from +recent progress of diffusion models, we find that the inner-loop gradient +descent process can be actually viewed as a reverse process (i.e., denoising) +of diffusion where the target of denoising is model weights but the origin +data. Based on this fact, in this paper, we propose to model the gradient +descent optimizer as a diffusion model and then present a novel +task-conditional diffusion-based meta-learning, called MetaDiff, that +effectively models the optimization process of model weights from Gaussion +noises to target weights in a denoising manner. Thanks to the training +efficiency of diffusion models, our MetaDiff do not need to differentiate +through the inner-loop path such that the memory burdens and the risk of +vanishing gradients can be effectvely alleviated. Experiment results show that +our MetaDiff outperforms the state-of-the-art gradient-based meta-learning +family in few-shot learning tasks.",cs.LG,['cs.LG'] +Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints,Muxin Zhang · Qiao Feng · Zhuo Su · Chao Wen · Zhou Xue · Kun Li, ,https://arxiv.org/abs/2312.08591,,2312.08591.pdf,Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints,"3D human generation is increasingly significant in various applications. +However, the direct use of 2D generative methods in 3D generation often results +in losing local details, while methods that reconstruct geometry from generated +images struggle with global view consistency. In this work, we introduce +Joint2Human, a novel method that leverages 2D diffusion models to generate +detailed 3D human geometry directly, ensuring both global structure and local +details. To achieve this, we employ the Fourier occupancy field (FOF) +representation, enabling the direct generation of 3D shapes as preliminary +results with 2D generative models. With the proposed high-frequency enhancer +and the multi-view recarving strategy, our method can seamlessly integrate the +details from different views into a uniform global shape. To better utilize the +3D human prior and enhance control over the generated geometry, we introduce a +compact spherical embedding of 3D joints. This allows for an effective guidance +of pose during the generation process. Additionally, our method can generate 3D +humans guided by textual inputs. Our experimental results demonstrate the +capability of our method to ensure global structure, local details, high +resolution, and low computational cost simultaneously. More results and the +code can be found on our project page at +http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human.",cs.CV,['cs.CV'] +Rethinking Generalizable Face Anti-spoofing via Hierarchical Prototype-guided Distribution Refinement in Hyperbolic Space,Chengyang Hu · Ke-Yue Zhang · Taiping Yao · Shouhong Ding · Lizhuang Ma, ,https://arxiv.org/abs/2308.09107,,2308.09107.pdf,Hyperbolic Face Anti-Spoofing,"Learning generalized face anti-spoofing (FAS) models against presentation +attacks is essential for the security of face recognition systems. Previous FAS +methods usually encourage models to extract discriminative features, of which +the distances within the same class (bonafide or attack) are pushed close while +those between bonafide and attack are pulled away. However, these methods are +designed based on Euclidean distance, which lacks generalization ability for +unseen attack detection due to poor hierarchy embedding ability. According to +the evidence that different spoofing attacks are intrinsically hierarchical, we +propose to learn richer hierarchical and discriminative spoofing cues in +hyperbolic space. Specifically, for unimodal FAS learning, the feature +embeddings are projected into the Poincar\'e ball, and then the hyperbolic +binary logistic regression layer is cascaded for classification. To further +improve generalization, we conduct hyperbolic contrastive learning for the +bonafide only while relaxing the constraints on diverse spoofing attacks. To +alleviate the vanishing gradient problem in hyperbolic space, a new feature +clipping method is proposed to enhance the training stability of hyperbolic +models. Besides, we further design a multimodal FAS framework with Euclidean +multimodal feature decomposition and hyperbolic multimodal feature fusion & +classification. Extensive experiments on three benchmark datasets (i.e., WMCA, +PADISI-Face, and SiW-M) with diverse attack types demonstrate that the proposed +method can bring significant improvement compared to the Euclidean baselines on +unseen attack detection. In addition, the proposed framework is also +generalized well on four benchmark datasets (i.e., MSU-MFSD, IDIAP +REPLAY-ATTACK, CASIA-FASD, and OULU-NPU) with a limited number of attack types.",cs.CV,['cs.CV'] +NARUTO: Neural Active Reconstruction from Uncertain Target Observations,Ziyue Feng · Huangying Zhan · Zheng Chen · Qingan Yan · Xiangyu Xu · Changjiang Cai · Bing Li · Qilun Zhu · Yi Xu,https://oppo-us-research.github.io/NARUTO-website/,https://arxiv.org/abs/2402.18771v2,,2402.18771v2.pdf,NARUTO: Neural Active Reconstruction from Uncertain Target Observations,"We present NARUTO, a neural active reconstruction system that combines a +hybrid neural representation with uncertainty learning, enabling high-fidelity +surface reconstruction. Our approach leverages a multi-resolution hash-grid as +the mapping backbone, chosen for its exceptional convergence speed and capacity +to capture high-frequency local features.The centerpiece of our work is the +incorporation of an uncertainty learning module that dynamically quantifies +reconstruction uncertainty while actively reconstructing the environment. By +harnessing learned uncertainty, we propose a novel uncertainty aggregation +strategy for goal searching and efficient path planning. Our system +autonomously explores by targeting uncertain observations and reconstructs +environments with remarkable completeness and fidelity. We also demonstrate the +utility of this uncertainty-aware approach by enhancing SOTA neural SLAM +systems through an active ray sampling strategy. Extensive evaluations of +NARUTO in various environments, using an indoor scene simulator, confirm its +superior performance and state-of-the-art status in active reconstruction, as +evidenced by its impressive results on benchmark datasets like Replica and +MP3D.",cs.CV,"['cs.CV', 'cs.RO']" +CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning,Shiyu Tian · Hongxin Wei · Yiqun Wang · Lei Feng, ,,https://dblp.org/rec/journals/corr/abs-2303-10365,,,,,nan +Generative Proxemics: A Prior for 3D Social Interaction from Images,Vickie Ye · Vickie Ye · Georgios Pavlakos · Michael J. Black · Angjoo Kanazawa,https://muelea.github.io/buddi/,https://arxiv.org/abs/2306.09337,,2306.09337.pdf,Generative Proxemics: A Prior for 3D Social Interaction from Images,"Social interaction is a fundamental aspect of human behavior and +communication. The way individuals position themselves in relation to others, +also known as proxemics, conveys social cues and affects the dynamics of social +interaction. Reconstructing such interaction from images presents challenges +because of mutual occlusion and the limited availability of large training +datasets. To address this, we present a novel approach that learns a prior over +the 3D proxemics two people in close social interaction and demonstrate its use +for single-view 3D reconstruction. We start by creating 3D training data of +interacting people using image datasets with contact annotations. We then model +the proxemics using a novel denoising diffusion model called BUDDI that learns +the joint distribution over the poses of two people in close social +interaction. Sampling from our generative proxemics model produces realistic 3D +human interactions, which we validate through a perceptual study. We use BUDDI +in reconstructing two people in close proximity from a single image without any +contact annotation via an optimization approach that uses the diffusion model +as a prior. Our approach recovers accurate and plausible 3D social interactions +from noisy initial estimates, outperforming state-of-the-art methods. Our code, +data, and model are availableat our project website at: muelea.github.io/buddi.",cs.CV,['cs.CV'] +Learning Degradation Independent Representations for Camera ISP Pipelines,Yanhui Guo · Fangzhou Luo · Xiaolin Wu, ,https://arxiv.org/abs/2307.00761v3,,2307.00761v3.pdf,Learning Degradation-Independent Representations for Camera ISP Pipelines,"Image signal processing (ISP) pipeline plays a fundamental role in digital +cameras, which converts raw Bayer sensor data to RGB images. However, +ISP-generated images usually suffer from imperfections due to the compounded +degradations that stem from sensor noises, demosaicing noises, compression +artifacts, and possibly adverse effects of erroneous ISP hyperparameter +settings such as ISO and gamma values. In a general sense, these ISP +imperfections can be considered as degradations. The highly complex mechanisms +of ISP degradations, some of which are even unknown, pose great challenges to +the generalization capability of deep neural networks (DNN) for image +restoration and to their adaptability to downstream tasks. To tackle the +issues, we propose a novel DNN approach to learn degradation-independent +representations (DiR) through the refinement of a self-supervised learned +baseline representation. The proposed DiR learning technique has remarkable +domain generalization capability and consequently, it outperforms +state-of-the-art methods across various downstream tasks, including blind image +restoration, object detection, and instance segmentation, as verified in our +experiments.",cs.CV,['cs.CV'] +VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation,XuDong Wang · Ishan Misra · Ziyun Zeng · Rohit Girdhar · Trevor Darrell, ,https://arxiv.org/abs/2308.14710,,2308.14710.pdf,VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation,"Existing approaches to unsupervised video instance segmentation typically +rely on motion estimates and experience difficulties tracking small or +divergent motions. We present VideoCutLER, a simple method for unsupervised +multi-instance video segmentation without using motion-based learning signals +like optical flow or training on natural videos. Our key insight is that using +high-quality pseudo masks and a simple video synthesis method for model +training is surprisingly sufficient to enable the resulting video model to +effectively segment and track multiple instances across video frames. We show +the first competitive unsupervised learning results on the challenging +YouTubeVIS-2019 benchmark, achieving 50.7% APvideo^50 , surpassing the previous +state-of-the-art by a large margin. VideoCutLER can also serve as a strong +pretrained model for supervised video instance segmentation tasks, exceeding +DINO by 15.9% on YouTubeVIS-2019 in terms of APvideo.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching,Xinghui Li · Jingyi Lu · Kai Han · Victor Adrian Prisacariu, ,https://arxiv.org/abs/2310.17569,,2310.17569.pdf,SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching,"In this paper, we address the challenge of matching semantically similar +keypoints across image pairs. Existing research indicates that the intermediate +output of the UNet within the Stable Diffusion (SD) can serve as robust image +feature maps for such a matching task. We demonstrate that by employing a basic +prompt tuning technique, the inherent potential of Stable Diffusion can be +harnessed, resulting in a significant enhancement in accuracy over previous +approaches. We further introduce a novel conditional prompting module that +conditions the prompt on the local details of the input image pairs, leading to +a further improvement in performance. We designate our approach as SD4Match, +short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of +SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets +new benchmarks in accuracy across all these datasets. Particularly, SD4Match +outperforms the previous state-of-the-art by a margin of 12 percentage points +on the challenging SPair-71k dataset.",cs.CV,"['cs.CV', 'cs.LG']" +PoNQ: a Neural QEM-based Mesh Representation,Nissim Maruani · Maks Ovsjanikov · Pierre Alliez · Mathieu Desbrun,https://nissmar.github.io/projects/ponq/,https://arxiv.org/abs/2403.12870,,2403.12870.pdf,PoNQ: a Neural QEM-based Mesh Representation,"Although polygon meshes have been a standard representation in geometry +processing, their irregular and combinatorial nature hinders their suitability +for learning-based applications. In this work, we introduce a novel learnable +mesh representation through a set of local 3D sample Points and their +associated Normals and Quadric error metrics (QEM) w.r.t. the underlying shape, +which we denote PoNQ. A global mesh is directly derived from PoNQ by +efficiently leveraging the knowledge of the local quadric errors. Besides +marking the first use of QEM within a neural shape representation, our +contribution guarantees both topological and geometrical properties by ensuring +that a PoNQ mesh does not self-intersect and is always the boundary of a +volume. Notably, our representation does not rely on a regular grid, is +supervised directly by the target surface alone, and also handles open surfaces +with boundaries and/or sharp features. We demonstrate the efficacy of PoNQ +through a learning-based mesh prediction from SDF grids and show that our +method surpasses recent state-of-the-art techniques in terms of both surface +and edge-based metrics.",cs.CV,['cs.CV'] +M&M VTO: Multi-Garment Virtual Try-On and Editing,Luyang Zhu · Yingwei Li · Nan Liu · Hao Peng · Dawei Yang · Ira Kemelmacher-Shlizerman,https://mmvto.github.io/,https://arxiv.org/abs/2405.07472,,2405.07472.pdf,GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting,"The increasing prominence of e-commerce has underscored the importance of +Virtual Try-On (VTON). However, previous studies predominantly focus on the 2D +realm and rely heavily on extensive data for training. Research on 3D VTON +primarily centers on garment-body shape compatibility, a topic extensively +covered in 2D VTON. Thanks to advances in 3D scene editing, a 2D diffusion +model has now been adapted for 3D editing via multi-viewpoint editing. In this +work, we propose GaussianVTON, an innovative 3D VTON pipeline integrating +Gaussian Splatting (GS) editing with 2D VTON. To facilitate a seamless +transition from 2D to 3D VTON, we propose, for the first time, the use of only +images as editing prompts for 3D editing. To further address issues, e.g., face +blurring, garment inaccuracy, and degraded viewpoint quality during editing, we +devise a three-stage refinement strategy to gradually mitigate potential +issues. Furthermore, we introduce a new editing strategy termed Edit Recall +Reconstruction (ERR) to tackle the limitations of previous editing strategies +in leading to complex geometric changes. Our comprehensive experiments +demonstrate the superiority of GaussianVTON, offering a novel perspective on 3D +VTON while also establishing a novel starting point for image-prompting 3D +scene editing.",cs.CV,['cs.CV'] +One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls,Minghui Hu · Jianbin Zheng · Chuanxia Zheng · Chaoyue Wang · Dacheng Tao · Tat-Jen Cham, ,https://arxiv.org/abs/2311.15744,,2311.15744.pdf,One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls,"It is well known that many open-released foundational diffusion models have +difficulty in generating images that substantially depart from average +brightness, despite such images being present in the training data. This is due +to an inconsistency: while denoising starts from pure Gaussian noise during +inference, the training noise schedule retains residual data even in the final +timestep distribution, due to difficulties in numerical conditioning in +mainstream formulation, leading to unintended bias during inference. To +mitigate this issue, certain $\epsilon$-prediction models are combined with an +ad-hoc offset-noise methodology. In parallel, some contemporary models have +adopted zero-terminal SNR noise schedules together with +$\mathbf{v}$-prediction, which necessitate major alterations to pre-trained +models. However, such changes risk destabilizing a large multitude of +community-driven applications anchored on these pre-trained models. In light of +this, our investigation revisits the fundamental causes, leading to our +proposal of an innovative and principled remedy, called One More Step (OMS). By +integrating a compact network and incorporating an additional simple yet +effective step during inference, OMS elevates image fidelity and harmonizes the +dichotomy between training and inference, while preserving original model +parameters. Once trained, various pre-trained diffusion models with the same +latent domain can share the same OMS module.",cs.CV,['cs.CV'] +Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features,Thomas Wimmer · Peter Wonka · Maks Ovsjanikov,https://wimmerth.github.io/back-to-3d.html,https://arxiv.org/abs/2311.18113,,2311.18113.pdf,Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features,"With the immense growth of dataset sizes and computing resources in recent +years, so-called foundation models have become popular in NLP and vision tasks. +In this work, we propose to explore foundation models for the task of keypoint +detection on 3D shapes. A unique characteristic of keypoint detection is that +it requires semantic and geometric awareness while demanding high localization +accuracy. To address this problem, we propose, first, to back-project features +from large pre-trained 2D vision models onto 3D shapes and employ them for this +task. We show that we obtain robust 3D features that contain rich semantic +information and analyze multiple candidate features stemming from different 2D +foundation models. Second, we employ a keypoint candidate optimization module +which aims to match the average observed distribution of keypoints on the shape +and is guided by the back-projected features. The resulting approach achieves a +new state of the art for few-shot keypoint detection on the KeyPointNet +dataset, almost doubling the performance of the previous best methods.",cs.CV,"['cs.CV', 'cs.GR']" +Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining,Xiang Chen · Jinshan Pan · Jiangxin Dong,https://github.com/cschenxiang/NeRD-Rain,https://arxiv.org/abs/2404.01547v1,,2404.01547v1.pdf,Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining,"How to effectively explore multi-scale representations of rain streaks is +important for image deraining. In contrast to existing Transformer-based +methods that depend mostly on single-scale rain appearance, we develop an +end-to-end multi-scale Transformer that leverages the potentially useful +features in various scales to facilitate high-quality image reconstruction. To +better explore the common degradation representations from spatially-varying +rain streaks, we incorporate intra-scale implicit neural representations based +on pixel coordinates with the degraded inputs in a closed-loop design, enabling +the learned features to facilitate rain removal and improve the robustness of +the model in complex scenarios. To ensure richer collaborative representation +from different scales, we embed a simple yet effective inter-scale +bidirectional feedback operation into our multi-scale Transformer by performing +coarse-to-fine and fine-to-coarse information communication. Extensive +experiments demonstrate that our approach, named as NeRD-Rain, performs +favorably against the state-of-the-art ones on both synthetic and real-world +benchmark datasets. The source code and trained models are available at +https://github.com/cschenxiang/NeRD-Rain.",cs.CV,['cs.CV'] +InstanceDiffusion: Instance-level Control for Image Generation,XuDong Wang · Trevor Darrell · Sai Saketh Rambhatla · Rohit Girdhar · Ishan Misra, ,https://arxiv.org/abs/2402.03290,,2402.03290.pdf,InstanceDiffusion: Instance-level Control for Image Generation,"Text-to-image diffusion models produce high quality images but do not offer +control over individual instances in the image. We introduce InstanceDiffusion +that adds precise instance-level control to text-to-image diffusion models. +InstanceDiffusion supports free-form language conditions per instance and +allows flexible ways to specify instance locations such as simple single +points, scribbles, bounding boxes or intricate instance segmentation masks, and +combinations thereof. We propose three major changes to text-to-image models +that enable precise instance-level control. Our UniFusion block enables +instance-level conditions for text-to-image models, the ScaleU block improves +image fidelity, and our Multi-instance Sampler improves generations for +multiple instances. InstanceDiffusion significantly surpasses specialized +state-of-the-art models for each location condition. Notably, on the COCO +dataset, we outperform previous state-of-the-art by 20.4% AP$_{50}^\text{box}$ +for box inputs, and 25.4% IoU for mask inputs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation,Qinghe Ma · Jian Zhang · Lei Qi · Qian Yu · Yinghuan Shi · Yang Gao, ,https://arxiv.org/abs/2404.08951,,2404.08951.pdf,Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation,"Both limited annotation and domain shift are prevalent challenges in medical +image segmentation. Traditional semi-supervised segmentation and unsupervised +domain adaptation methods address one of these issues separately. However, the +coexistence of limited annotation and domain shift is quite common, which +motivates us to introduce a novel and challenging scenario: Mixed Domain +Semi-supervised medical image Segmentation (MiDSS). In this scenario, we handle +data from multiple medical centers, with limited annotations available for a +single domain and a large amount of unlabeled data from multiple domains. We +found that the key to solving the problem lies in how to generate reliable +pseudo labels for the unlabeled data in the presence of domain shift with +labeled data. To tackle this issue, we employ Unified Copy-Paste (UCP) between +images to construct intermediate domains, facilitating the knowledge transfer +from the domain of labeled data to the domains of unlabeled data. To fully +utilize the information within the intermediate domain, we propose a symmetric +Guidance training strategy (SymGD), which additionally offers direct guidance +to unlabeled data by merging pseudo labels from intermediate samples. +Subsequently, we introduce a Training Process aware Random Amplitude MixUp +(TP-RAM) to progressively incorporate style-transition components into +intermediate samples. Compared with existing state-of-the-art approaches, our +method achieves a notable 13.57% improvement in Dice score on Prostate dataset, +as demonstrated on three public datasets. Our code is available at +https://github.com/MQinghe/MiDSS .",cs.CV,"['cs.CV', 'cs.LG']" +NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,Yannan He · Garvita Tiwari · Tolga Birdal · Jan Lenssen · Gerard Pons-Moll, ,https://arxiv.org/abs/2403.03122v1,,2403.03122v1.pdf,NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,"Faithfully modeling the space of articulations is a crucial task that allows +recovery and generation of realistic poses, and remains a notorious challenge. +To this end, we introduce Neural Riemannian Distance Fields (NRDFs), +data-driven priors modeling the space of plausible articulations, represented +as the zero-level-set of a neural field in a high-dimensional +product-quaternion space. To train NRDFs only on positive examples, we +introduce a new sampling algorithm, ensuring that the geodesic distances follow +a desired distribution, yielding a principled distance field learning paradigm. +We then devise a projection algorithm to map any random pose onto the level-set +by an adaptive-step Riemannian optimizer, adhering to the product manifold of +joint rotations at all times. NRDFs can compute the Riemannian gradient via +backpropagation and by mathematical analogy, are related to Riemannian flow +matching, a recent generative model. We conduct a comprehensive evaluation of +NRDF against other pose priors in various downstream tasks, i.e., pose +generation, image-based pose estimation, and solving inverse kinematics, +highlighting NRDF's superior performance. Besides humans, NRDF's versatility +extends to hand and animal poses, as it can effectively represent any +articulation.",cs.CV,['cs.CV'] +Privacy-Preserving Face Recognition Using Trainable Feature Subtraction,Yuxi Mi · Zhizhou Zhong · Yuge Huang · Jiazhen Ji · Jianqing Xu · Jun Wang · ShaoMing Wang · Shouhong Ding · Shuigeng Zhou,https://github.com/Tencent/TFace/tree/master/recognition/tasks/minusface,https://arxiv.org/abs/2403.12457,,,Privacy-Preserving Face Recognition Using Trainable Feature Subtraction,"The widespread adoption of face recognition has led to increasing privacy +concerns, as unauthorized access to face images can expose sensitive personal +information. This paper explores face image protection against viewing and +recovery attacks. Inspired by image compression, we propose creating a visually +uninformative face image through feature subtraction between an original face +and its model-produced regeneration. Recognizable identity features within the +image are encouraged by co-training a recognition model on its high-dimensional +feature representation. To enhance privacy, the high-dimensional representation +is crafted through random channel shuffling, resulting in randomized +recognizable images devoid of attacker-leverageable texture details. We distill +our methodologies into a novel privacy-preserving face recognition method, +MinusFace. Experiments demonstrate its high recognition accuracy and effective +privacy protection. Its code is available at https://github.com/Tencent/TFace.",cs.CV,['cs.CV'] +Generating Human Motion in 3D Scenes from Text Descriptions,Zhi Cen · Huaijin Pi · Sida Peng · Zehong Shen · Minghui Yang · Shuai Zhu · Hujun Bao · Xiaowei Zhou,https://zju3dv.github.io/text_scene_motion/,https://arxiv.org/html/2405.07784v1,,2405.07784v1.pdf,Generating Human Motion in 3D Scenes from Text Descriptions,"Generating human motions from textual descriptions has gained growing +research interest due to its wide range of applications. However, only a few +works consider human-scene interactions together with text conditions, which is +crucial for visual and physical realism. This paper focuses on the task of +generating human motions in 3D indoor scenes given text descriptions of the +human-scene interactions. This task presents challenges due to the +multi-modality nature of text, scene, and motion, as well as the need for +spatial reasoning. To address these challenges, we propose a new approach that +decomposes the complex problem into two more manageable sub-problems: (1) +language grounding of the target object and (2) object-centric motion +generation. For language grounding of the target object, we leverage the power +of large language models. For motion generation, we design an object-centric +scene representation for the generative model to focus on the target object, +thereby reducing the scene complexity and facilitating the modeling of the +relationship between human motions and the object. Experiments demonstrate the +better motion quality of our approach compared to baselines and validate our +design choices.",cs.CV,['cs.CV'] +HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting,Xian Liu · Xiaohang Zhan · Jiaxiang Tang · Ying Shan · Gang Zeng · Dahua Lin · Xihui Liu · Ziwei Liu,https://alvinliu0.github.io/projects/HumanGaussian,https://arxiv.org/abs/2311.17061,,2311.17061.pdf,HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting,"Realistic 3D human generation from text prompts is a desirable yet +challenging task. Existing methods optimize 3D representations like mesh or +neural fields via score distillation sampling (SDS), which suffers from +inadequate fine details or excessive training time. In this paper, we propose +an efficient yet effective framework, HumanGaussian, that generates +high-quality 3D humans with fine-grained geometry and realistic appearance. Our +key insight is that 3D Gaussian Splatting is an efficient renderer with +periodic Gaussian shrinkage or growing, where such adaptive density control can +be naturally guided by intrinsic human structures. Specifically, 1) we first +propose a Structure-Aware SDS that simultaneously optimizes human appearance +and geometry. The multi-modal score function from both RGB and depth space is +leveraged to distill the Gaussian densification and pruning process. 2) +Moreover, we devise an Annealed Negative Prompt Guidance by decomposing SDS +into a noisier generative score and a cleaner classifier score, which well +addresses the over-saturation issue. The floating artifacts are further +eliminated based on Gaussian size in a prune-only phase to enhance generation +smoothness. Extensive experiments demonstrate the superior efficiency and +competitive quality of our framework, rendering vivid 3D humans under diverse +scenarios. Project Page: https://alvinliu0.github.io/projects/HumanGaussian",cs.CV,['cs.CV'] +"See, Say, and Segment: Correcting False Premises with LMMs",Tsung-Han Wu · Giscard Biamby · David Chan · Lisa Dunlap · Ritwik Gupta · XuDong Wang · Trevor Darrell · Joseph Gonzalez,https://see-say-segment.github.io/,https://arxiv.org/html/2312.08366v1,,2312.08366v1.pdf,"See, Say, and Segment: Teaching LMMs to Overcome False Premises","Current open-source Large Multimodal Models (LMMs) excel at tasks such as +open-vocabulary language grounding and segmentation but can suffer under false +premises when queries imply the existence of something that is not actually +present in the image. We observe that existing methods that fine-tune an LMM to +segment images significantly degrade their ability to reliably determine +(""see"") if an object is present and to interact naturally with humans (""say""), +a form of catastrophic forgetting. In this work, we propose a cascading and +joint training approach for LMMs to solve this task, avoiding catastrophic +forgetting of previous skills. Our resulting model can ""see"" by detecting +whether objects are present in an image, ""say"" by telling the user if they are +not, proposing alternative queries or correcting semantic errors in the query, +and finally ""segment"" by outputting the mask of the desired objects if they +exist. Additionally, we introduce a novel False Premise Correction benchmark +dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets +(which we call FP-RefCOCO(+/g)). The results show that our method not only +detects false premises up to 55% better than existing approaches, but under +false premise conditions produces relative cIOU improvements of more than 31% +over baselines, and produces natural language feedback judged helpful up to 67% +of the time.",cs.CV,['cs.CV'] +Investigating and Mitigating the Side Effects of Noisy Views for Self-Supervised Clustering Algorithms in Practical Multi-View Scenarios,Jie Xu · Yazhou Ren · Xiaolong Wang · Lei Feng · Zheng Zhang · Gang Niu · Xiaofeng Zhu,https://github.com/SubmissionsIn/MVCAN,,https://submissionsin.github.io/,,,,,nan +Learned representation-guided diffusion models for large-image generation,Alexandros Graikos · Srikar Yellapragada · Minh-Quan Le · Saarthak Kapse · Prateek Prasanna · Joel Saltz · Dimitris Samaras,https://histodiffusion.github.io/docs/publications/cvpr_24,https://arxiv.org/abs/2312.07330,,2312.07330.pdf,Learned representation-guided diffusion models for large-image generation,"To synthesize high-fidelity samples, diffusion models typically require +auxiliary data to guide the generation process. However, it is impractical to +procure the painstaking patch-level annotation effort required in specialized +domains like histopathology and satellite imagery; it is often performed by +domain experts and involves hundreds of millions of patches. Modern-day +self-supervised learning (SSL) representations encode rich semantic and visual +information. In this paper, we posit that such representations are expressive +enough to act as proxies to fine-grained human labels. We introduce a novel +approach that trains diffusion models conditioned on embeddings from SSL. Our +diffusion models successfully project these features back to high-quality +histopathology and remote sensing images. In addition, we construct larger +images by assembling spatially consistent patches inferred from SSL embeddings, +preserving long-range dependencies. Augmenting real data by generating +variations of real images improves downstream classifier accuracy for +patch-level and larger, image-scale classification tasks. Our models are +effective even on datasets not encountered during training, demonstrating their +robustness and generalizability. Generating images from learned embeddings is +agnostic to the source of the embeddings. The SSL embeddings used to generate a +large image can either be extracted from a reference image, or sampled from an +auxiliary model conditioned on any related modality (e.g. class labels, text, +genomic data). As proof of concept, we introduce the text-to-large image +synthesis paradigm where we successfully synthesize large pathology and +satellite images out of text descriptions.",cs.CV,['cs.CV'] +PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis,Zhengyao Lv · Yuxiang Wei · Wangmeng Zuo · Kwan-Yee K. Wong, ,https://arxiv.org/abs/2403.01852,,2403.01852.pdf,PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis,"Recent advancements in large-scale pre-trained text-to-image models have led +to remarkable progress in semantic image synthesis. Nevertheless, synthesizing +high-quality images with consistent semantics and layout remains a challenge. +In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) +that harnesses pre-trained models to alleviate the aforementioned issues. +Specifically, we first employ the layout control map to faithfully represent +layouts in the feature space. Subsequently, we combine the layout and semantic +features in a timestep-adaptive manner to synthesize images with realistic +details. During fine-tuning, we propose the Semantic Alignment (SA) loss to +further enhance layout alignment. Additionally, we introduce the Layout-Free +Prior Preservation (LFP) loss, which leverages unlabeled data to maintain the +priors of pre-trained models, thereby improving the visual quality and semantic +consistency of synthesized images. Extensive experiments demonstrate that our +approach performs favorably in terms of visual quality, semantic consistency, +and layout alignment. The source code and model are available at +https://github.com/cszy98/PLACE/tree/main.",cs.CV,['cs.CV'] +Regressor-Segmenter Mutual Prompt Learning for Crowd Counting,Mingyue Guo · Li Yuan · Zhaoyi Yan · Binghui Chen · Yaowei Wang · Qixiang Ye, ,https://arxiv.org/abs/2312.01711v2,,2312.01711v2.pdf,Regressor-Segmenter Mutual Prompt Learning for Crowd Counting,"Crowd counting has achieved significant progress by training regressors to +predict instance positions. In heavily crowded scenarios, however, regressors +are challenged by uncontrollable annotation variance, which causes density map +bias and context information inaccuracy. In this study, we propose mutual +prompt learning (mPrompt), which leverages a regressor and a segmenter as +guidance for each other, solving bias and inaccuracy caused by annotation +variance while distinguishing foreground from background. In specific, mPrompt +leverages point annotations to tune the segmenter and predict pseudo head masks +in a way of point prompt learning. It then uses the predicted segmentation +masks, which serve as spatial constraint, to rectify biased point annotations +as context prompt learning. mPrompt defines a way of mutual information +maximization from prompt learning, mitigating the impact of annotation variance +while improving model accuracy. Experiments show that mPrompt significantly +reduces the Mean Average Error (MAE), demonstrating the potential to be general +framework for down-stream vision tasks.",cs.CV,['cs.CV'] +SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,Zhijing Shao · Wang Zhaolong · Zhuang Li · Duotun Wang · Xiangru Lin · Yu Zhang · Mingming Fan · Zeyu Wang,https://initialneil.github.io/SplattingAvatar,https://arxiv.org/abs/2403.05087,,2403.05087.pdf,SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,"We present SplattingAvatar, a hybrid 3D representation of photorealistic +human avatars with Gaussian Splatting embedded on a triangle mesh, which +renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We +disentangle the motion and appearance of a virtual human with explicit mesh +geometry and implicit appearance modeling with Gaussian Splatting. The +Gaussians are defined by barycentric coordinates and displacement on a triangle +mesh as Phong surfaces. We extend lifted optimization to simultaneously +optimize the parameters of the Gaussians while walking on the triangle mesh. +SplattingAvatar is a hybrid representation of virtual humans where the mesh +represents low-frequency motion and surface deformation, while the Gaussians +take over the high-frequency geometry and detailed appearance. Unlike existing +deformation methods that rely on an MLP-based linear blend skinning (LBS) field +for motion, we control the rotation and translation of the Gaussians directly +by mesh, which empowers its compatibility with various animation techniques, +e.g., skeletal animation, blend shapes, and mesh editing. Trainable from +monocular videos for both full-body and head avatars, SplattingAvatar shows +state-of-the-art rendering quality across multiple datasets.",cs.GR,"['cs.GR', 'cs.CV']" +FastMAC: Stochastic Spectral Sampling of Correspondence Graph,Yifei Zhang · Hao Zhao · Hongyang Li · Siheng Chen,https://github.com/Forrest-110/FastMAC,https://arxiv.org/abs/2403.08770,,2403.08770.pdf,FastMAC: Stochastic Spectral Sampling of Correspondence Graph,"3D correspondence, i.e., a pair of 3D points, is a fundamental concept in +computer vision. A set of 3D correspondences, when equipped with compatibility +edges, forms a correspondence graph. This graph is a critical component in +several state-of-the-art 3D point cloud registration approaches, e.g., the one +based on maximal cliques (MAC). However, its properties have not been well +understood. So we present the first study that introduces graph signal +processing into the domain of correspondence graph. We exploit the generalized +degree signal on correspondence graph and pursue sampling strategies that +preserve high-frequency components of this signal. To address time-consuming +singular value decomposition in deterministic sampling, we resort to a +stochastic approximate sampling strategy. As such, the core of our method is +the stochastic spectral sampling of correspondence graph. As an application, we +build a complete 3D registration algorithm termed as FastMAC, that reaches +real-time speed while leading to little to none performance drop. Through +extensive experiments, we validate that FastMAC works for both indoor and +outdoor benchmarks. For example, FastMAC can accelerate MAC by 80 times while +maintaining high registration success rate on KITTI. Codes are publicly +available at https://github.com/Forrest-110/FastMAC.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +Fairy: Fast Parallellized Instruction-Guided Video-to-Video Synthesis,Bichen Wu · Ching-Yao Chuang · Xiaoyan Wang · Yichen Jia · Kapil Krishnakumar · Tong Xiao · Feng Liang · Licheng Yu · Peter Vajda, ,https://arxiv.org/abs/2312.13834,,2312.13834.pdf,Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis,"In this paper, we introduce Fairy, a minimalist yet robust adaptation of +image-editing diffusion models, enhancing them for video editing applications. +Our approach centers on the concept of anchor-based cross-frame attention, a +mechanism that implicitly propagates diffusion features across frames, ensuring +superior temporal coherence and high-fidelity synthesis. Fairy not only +addresses limitations of previous models, including memory and processing +speed. It also improves temporal consistency through a unique data augmentation +strategy. This strategy renders the model equivariant to affine transformations +in both source and target images. Remarkably efficient, Fairy generates +120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds, +outpacing prior works by at least 44x. A comprehensive user study, involving +1000 generated samples, confirms that our approach delivers superior quality, +decisively outperforming established methods.",cs.CV,['cs.CV'] +MMA: Multi-Modal Adapter for Vision-Language Models,Lingxiao Yang · Ru-Yuan Zhang · Yanchen Wang · Xiaohua Xie, ,https://arxiv.org/abs/2405.15684,,2405.15684.pdf,Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models,"To bridge the gap between vision and language modalities, Multimodal Large +Language Models (MLLMs) usually learn an adapter that converts visual inputs to +understandable tokens for Large Language Models (LLMs). However, most adapters +generate consistent visual tokens, regardless of the specific objects of +interest mentioned in the prompt. Since these adapters distribute equal +attention to every detail in the image and focus on the entire scene, they may +increase the cognitive load for LLMs, particularly when processing complex +scenes. To alleviate this problem, we propose prompt-aware adapters. These +adapters are designed with the capability to dynamically embed visual inputs +based on the specific focus of the prompt. Specifically, prompt-aware adapters +utilize both global and local textual features to capture the most relevant +visual clues from the prompt at both coarse and fine granularity levels. This +approach significantly enhances the ability of LLMs to understand and interpret +visual content. Experiments on various visual question answering tasks, such as +counting and position reasoning, demonstrate the effectiveness of prompt-aware +adapters.",cs.CV,"['cs.CV', 'cs.AI']" +RoDLA: Benchmarking the Robustness of Document Layout Analysis Models,Yufan Chen · Jiaming Zhang · Kunyu Peng · Junwei Zheng · Ruiping Liu · Philip H.S. Torr · Rainer Stiefelhagen,https://yufanchen96.github.io/projects/RoDLA/,https://arxiv.org/abs/2403.14442,,2403.14442.pdf,RoDLA: Benchmarking the Robustness of Document Layout Analysis Models,"Before developing a Document Layout Analysis (DLA) model in real-world +applications, conducting comprehensive robustness testing is essential. +However, the robustness of DLA models remains underexplored in the literature. +To address this, we are the first to introduce a robustness benchmark for DLA +models, which includes 450K document images of three datasets. To cover +realistic corruptions, we propose a perturbation taxonomy with 36 common +document perturbations inspired by real-world document processing. +Additionally, to better understand document perturbation impacts, we propose +two metrics, Mean Perturbation Effect (mPE) for perturbation assessment and +Mean Robustness Degradation (mRD) for robustness evaluation. Furthermore, we +introduce a self-titled model, i.e., Robust Document Layout Analyzer (RoDLA), +which improves attention mechanisms to boost extraction of robust features. +Experiments on the proposed benchmarks (PubLayNet-P, DocLayNet-P, and +M$^6$Doc-P) demonstrate that RoDLA obtains state-of-the-art mRD scores of +115.7, 135.4, and 150.4, respectively. Compared to previous methods, RoDLA +achieves notable improvements in mAP of +3.8%, +7.1% and +12.1%, respectively.",cs.CV,['cs.CV'] +LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion,Pancheng Zhao · Peng Xu · Pengda Qin · Deng-Ping Fan · Zhicheng Zhang · Guoli Jia · Bowen Zhou · Jufeng Yang, ,https://arxiv.org/abs/2404.00292,,2404.00292.pdf,LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion,"Camouflaged vision perception is an important vision task with numerous +practical applications. Due to the expensive collection and labeling costs, +this community struggles with a major bottleneck that the species category of +its datasets is limited to a small number of object species. However, the +existing camouflaged generation methods require specifying the background +manually, thus failing to extend the camouflaged sample diversity in a low-cost +manner. In this paper, we propose a Latent Background Knowledge +Retrieval-Augmented Diffusion (LAKE-RED) for camouflaged image generation. To +our knowledge, our contributions mainly include: (1) For the first time, we +propose a camouflaged generation paradigm that does not need to receive any +background inputs. (2) Our LAKE-RED is the first knowledge retrieval-augmented +method with interpretability for camouflaged generation, in which we propose an +idea that knowledge retrieval and reasoning enhancement are separated +explicitly, to alleviate the task-specific challenges. Moreover, our method is +not restricted to specific foreground targets or backgrounds, offering a +potential for extending camouflaged vision perception to more diverse domains. +(3) Experimental results demonstrate that our method outperforms the existing +approaches, generating more realistic camouflage images.",cs.CV,['cs.CV'] +Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding,Peng Jin · Ryuichi Takanobu · Cai Zhang · Xiaochun Cao · Li Yuan,https://github.com/PKU-YuanGroup/Chat-UniVi,https://arxiv.org/abs/2311.08046,,2311.08046.pdf,Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding,"Large language models have demonstrated impressive universal capabilities +across a wide range of open-ended tasks and have extended their utility to +encompass multimodal conversations. However, existing methods encounter +challenges in effectively handling both image and video understanding, +particularly with limited visual tokens. In this work, we introduce Chat-UniVi, +a Unified Vision-language model capable of comprehending and engaging in +conversations involving images and videos through a unified visual +representation. Specifically, we employ a set of dynamic visual tokens to +uniformly represent images and videos. This representation framework empowers +the model to efficiently utilize a limited number of visual tokens to +simultaneously capture the spatial details necessary for images and the +comprehensive temporal relationship required for videos. Moreover, we leverage +a multi-scale representation, enabling the model to perceive both high-level +semantic concepts and low-level visual details. Notably, Chat-UniVi is trained +on a mixed dataset containing both images and videos, allowing direct +application to tasks involving both mediums without requiring any +modifications. Extensive experimental results demonstrate that Chat-UniVi +consistently outperforms even existing methods exclusively designed for either +images or videos. Code is available at +https://github.com/PKU-YuanGroup/Chat-UniVi.",cs.CV,['cs.CV'] +HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images,Xihe Yang · Xingyu Chen · Daiheng Gao · Finn Wong · Xiaoguang Han · Baoyuan Wang, ,https://arxiv.org/abs/2311.15672,,2311.15672.pdf,HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images,"As for human avatar reconstruction, contemporary techniques commonly +necessitate the acquisition of costly data and struggle to achieve satisfactory +results from a small number of casual images. In this paper, we investigate +this task from a few-shot unconstrained photo album. The reconstruction of +human avatars from such data sources is challenging because of limited data +amount and dynamic articulated poses. For handling dynamic data, we integrate a +skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable +tetrahedral representation, which drives arbitrary mesh topologies generated by +the DMTet for the adaptation of unconstrained images. To effectively mine +instructive information from few-shot data, we devise a two-phase optimization +method with few-shot reference and few-shot guidance. The former focuses on +aligning avatar identity with reference images, while the latter aims to +generate plausible appearances for unseen regions. Overall, our framework, +called HaveFun, can undertake avatar reconstruction, rendering, and animation. +Extensive experiments on our developed benchmarks demonstrate that HaveFun +exhibits substantially superior performance in reconstructing the human body +and hand. Project website: https://seanchenxy.github.io/HaveFunWeb/.",cs.CV,['cs.CV'] +BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP,Jiawang Bai · Kuofeng Gao · Shaobo Min · Shu-Tao Xia · Zhifeng Li · Wei Liu, ,https://arxiv.org/abs/2311.16194,,2311.16194.pdf,BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP,"Contrastive Vision-Language Pre-training, known as CLIP, has shown promising +effectiveness in addressing downstream image recognition tasks. However, recent +works revealed that the CLIP model can be implanted with a downstream-oriented +backdoor. On downstream tasks, one victim model performs well on clean samples +but predicts a specific target class whenever a specific trigger is present. +For injecting a backdoor, existing attacks depend on a large amount of +additional data to maliciously fine-tune the entire pre-trained CLIP model, +which makes them inapplicable to data-limited scenarios. In this work, +motivated by the recent success of learnable prompts, we address this problem +by injecting a backdoor into the CLIP model in the prompt learning stage. Our +method named BadCLIP is built on a novel and effective mechanism in backdoor +attacks on CLIP, i.e., influencing both the image and text encoders with the +trigger. It consists of a learnable trigger applied to images and a +trigger-aware context generator, such that the trigger can change text features +via trigger-aware prompts, resulting in a powerful and generalizable attack. +Extensive experiments conducted on 11 datasets verify that the clean accuracy +of BadCLIP is similar to those of advanced prompt learning methods and the +attack success rate is higher than 99% in most cases. BadCLIP is also +generalizable to unseen classes, and shows a strong generalization capability +under cross-dataset and cross-domain settings.",cs.CV,['cs.CV'] +PromptKD: Unsupervised Prompt Distillation for Vision-Language Models,Zheng Li · Xiang Li · xinyi fu · Xin Zhang · Weiqiang Wang · Shuo Chen · Jian Yang,https://zhengli97.github.io/PromptKD/,https://arxiv.org/abs/2403.02781v3,,2403.02781v3.pdf,PromptKD: Unsupervised Prompt Distillation for Vision-Language Models,"Prompt learning has emerged as a valuable technique in enhancing +vision-language models (VLMs) such as CLIP for downstream tasks in specific +domains. Existing work mainly focuses on designing various learning forms of +prompts, neglecting the potential of prompts as effective distillers for +learning from larger teacher models. In this paper, we introduce an +unsupervised domain prompt distillation framework, which aims to transfer the +knowledge of a larger teacher model to a lightweight target model through +prompt-driven imitation using unlabeled domain images. Specifically, our +framework consists of two distinct stages. In the initial stage, we pre-train a +large CLIP teacher model using domain (few-shot) labels. After pre-training, we +leverage the unique decoupled-modality characteristics of CLIP by pre-computing +and storing the text features as class vectors only once through the teacher +text encoder. In the subsequent stage, the stored class vectors are shared +across teacher and student image encoders for calculating the predicted logits. +Further, we align the logits of both the teacher and student models via KL +divergence, encouraging the student image encoder to generate similar +probability distributions to the teacher through the learnable prompts. The +proposed prompt distillation process eliminates the reliance on labeled data, +enabling the algorithm to leverage a vast amount of unlabeled images within the +domain. Finally, the well-trained student image encoders and pre-stored text +features (class vectors) are utilized for inference. To our best knowledge, we +are the first to (1) perform unsupervised domain-specific prompt-driven +knowledge distillation for CLIP, and (2) establish a practical pre-storing +mechanism of text features as shared class vectors between teacher and student. +Extensive experiments on 11 datasets demonstrate the effectiveness of our +method.",cs.CV,['cs.CV'] +IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM,Minghao Yin · Shangzhe Wu · Kai Han, ,https://arxiv.org/html/2405.03413v2,,2405.03413v2.pdf,SL-SLAM: A robust visual-inertial SLAM based deep feature extraction and matching,"This paper explores how deep learning techniques can improve visual-based +SLAM performance in challenging environments. By combining deep feature +extraction and deep matching methods, we introduce a versatile hybrid visual +SLAM system designed to enhance adaptability in challenging scenarios, such as +low-light conditions, dynamic lighting, weak-texture areas, and severe jitter. +Our system supports multiple modes, including monocular, stereo, +monocular-inertial, and stereo-inertial configurations. We also perform +analysis how to combine visual SLAM with deep learning methods to enlighten +other researches. Through extensive experiments on both public datasets and +self-sampled data, we demonstrate the superiority of the SL-SLAM system over +traditional approaches. The experimental results show that SL-SLAM outperforms +state-of-the-art SLAM algorithms in terms of localization accuracy and tracking +robustness. For the benefit of community, we make public the source code at +https://github.com/zzzzxxxx111/SLslam.",cs.RO,['cs.RO'] +GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering,Abdullah J Hamdi · Luke Melas-Kyriazi · Jinjie Mai · Guocheng Qian · Ruoshi Liu · Carl Vondrick · Bernard Ghanem · Andrea Vedaldi, ,https://arxiv.org/abs/2402.10128,,2402.10128.pdf,GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering,"Advancements in 3D Gaussian Splatting have significantly accelerated 3D +reconstruction and generation. However, it may require a large number of +Gaussians, which creates a substantial memory footprint. This paper introduces +GES (Generalized Exponential Splatting), a novel representation that employs +Generalized Exponential Function (GEF) to model 3D scenes, requiring far fewer +particles to represent a scene and thus significantly outperforming Gaussian +Splatting methods in efficiency with a plug-and-play replacement ability for +Gaussian-based utilities. GES is validated theoretically and empirically in +both principled 1D setup and realistic 3D scenes. + It is shown to represent signals with sharp edges more accurately, which are +typically challenging for Gaussians due to their inherent low-pass +characteristics. Our empirical analysis demonstrates that GEF outperforms +Gaussians in fitting natural-occurring signals (e.g. squares, triangles, and +parabolic signals), thereby reducing the need for extensive splitting +operations that increase the memory footprint of Gaussian Splatting. With the +aid of a frequency-modulated loss, GES achieves competitive performance in +novel-view synthesis benchmarks while requiring less than half the memory +storage of Gaussian Splatting and increasing the rendering speed by up to 39%. +The code is available on the project website https://abdullahamdi.com/ges .",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching,Lennart Bastian · Yizheng Xie · Nassir Navab · Zorah Lähner, ,https://arxiv.org/abs/2312.03678,,2312.03678.pdf,Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching,"Non-isometric shape correspondence remains a fundamental challenge in +computer vision. Traditional methods using Laplace-Beltrami operator (LBO) +eigenmodes face limitations in characterizing high-frequency extrinsic shape +changes like bending and creases. We propose a novel approach of combining the +non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell +hessian with the intrinsic ones of the LBO, creating a hybrid spectral space in +which we construct functional maps. To this end, we present a theoretical +framework to effectively integrate non-orthogonal basis functions into +descriptor- and learning-based functional map methods. Our approach can be +incorporated easily into existing functional map pipelines across varying +applications and is able to handle complex deformations beyond isometries. We +show extensive evaluations across various supervised and unsupervised settings +and demonstrate significant improvements. Notably, our approach achieves up to +15% better mean geodesic error for non-isometric correspondence settings and up +to 45% improvement in scenarios with topological noise.",cs.CV,['cs.CV'] +DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction,Junwen Xiong · Peng Zhang · Tao You · Chuanyue Li · Wei Huang · Yufei Zha,https://github.com/junwenxiong/diff_sal,https://arxiv.org/abs/2403.01226,,2403.01226.pdf,DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction,"Audio-visual saliency prediction can draw support from diverse modality +complements, but further performance enhancement is still challenged by +customized architectures as well as task-specific loss functions. In recent +studies, denoising diffusion models have shown more promising in unifying task +frameworks owing to their inherent ability of generalization. Following this +motivation, a novel Diffusion architecture for generalized audio-visual +Saliency prediction (DiffSal) is proposed in this work, which formulates the +prediction problem as a conditional generative task of the saliency map by +utilizing input audio and video as the conditions. Based on the spatio-temporal +audio-visual features, an extra network Saliency-UNet is designed to perform +multi-modal attention modulation for progressive refinement of the ground-truth +saliency map from the noisy map. Extensive experiments demonstrate that the +proposed DiffSal can achieve excellent performance across six challenging +audio-visual benchmarks, with an average relative improvement of 6.3\% over the +previous state-of-the-art results by six metrics.",cs.CV,['cs.CV'] +Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models,Zijin Yang · Kai Zeng · Kejiang Chen · Han Fang · Weiming Zhang · Nenghai Yu, ,https://arxiv.org/abs/2404.04956,,2404.04956.pdf,Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models,"Ethical concerns surrounding copyright protection and inappropriate content +generation pose challenges for the practical implementation of diffusion +models. One effective solution involves watermarking the generated images. +However, existing methods often compromise the model performance or require +additional training, which is undesirable for operators and users. To address +this issue, we propose Gaussian Shading, a diffusion model watermarking +technique that is both performance-lossless and training-free, while serving +the dual purpose of copyright protection and tracing of offending content. Our +watermark embedding is free of model parameter modifications and thus is +plug-and-play. We map the watermark to latent representations following a +standard Gaussian distribution, which is indistinguishable from latent +representations obtained from the non-watermarked diffusion model. Therefore we +can achieve watermark embedding with lossless performance, for which we also +provide theoretical proof. Furthermore, since the watermark is intricately +linked with image semantics, it exhibits resilience to lossy processing and +erasure attempts. The watermark can be extracted by Denoising Diffusion +Implicit Models (DDIM) inversion and inverse sampling. We evaluate Gaussian +Shading on multiple versions of Stable Diffusion, and the results demonstrate +that Gaussian Shading not only is performance-lossless but also outperforms +existing methods in terms of robustness.",cs.CV,"['cs.CV', 'cs.CR']" +Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos,Chen Liu · Peike Li · Qingtao Yu · Hongwei Sheng · Dadong Wang · Lincheng Li · Xin Yu, ,https://arxiv.org/abs/2307.16620,,2307.16620.pdf,Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics,"The audio-visual segmentation (AVS) task aims to segment sounding objects +from a given video. Existing works mainly focus on fusing audio and visual +features of a given video to achieve sounding object masks. However, we +observed that prior arts are prone to segment a certain salient object in a +video regardless of the audio information. This is because sounding objects are +often the most salient ones in the AVS dataset. Thus, current AVS methods might +fail to localize genuine sounding objects due to the dataset bias. In this +work, we present an audio-visual instance-aware segmentation approach to +overcome the dataset bias. In a nutshell, our method first localizes potential +sounding objects in a video by an object segmentation network, and then +associates the sounding object candidates with the given audio. We notice that +an object could be a sounding object in one video but a silent one in another +video. This would bring ambiguity in training our object segmentation network +as only sounding objects have corresponding segmentation masks. We thus propose +a silent object-aware segmentation objective to alleviate the ambiguity. +Moreover, since the category information of audio is unknown, especially for +multiple sounding sources, we propose to explore the audio-visual semantic +correlation and then associate audio with potential objects. Specifically, we +attend predicted audio category scores to potential instance masks and these +scores will highlight corresponding sounding instances while suppressing +inaudible ones. When we enforce the attended instance masks to resemble the +ground-truth mask, we are able to establish audio-visual semantics correlation. +Experimental results on the AVS benchmarks demonstrate that our method can +effectively segment sounding objects without being biased to salient objects.",cs.SD,"['cs.SD', 'cs.CV', 'eess.AS']" +Modular Blind Video Quality Assessment,Wen Wen · Mu Li · Yabin ZHANG · Yiting Liao · Junlin Li · Li zhang · Kede Ma, ,https://arxiv.org/abs/2402.19276,,2402.19276.pdf,Modular Blind Video Quality Assessment,"Blind video quality assessment (BVQA) plays a pivotal role in evaluating and +improving the viewing experience of end-users across a wide range of +video-based platforms and services. Contemporary deep learning-based models +primarily analyze video content in its aggressively subsampled format, while +being blind to the impact of the actual spatial resolution and frame rate on +video quality. In this paper, we propose a modular BVQA model and a method of +training it to improve its modularity. Our model comprises a base quality +predictor, a spatial rectifier, and a temporal rectifier, responding to the +visual content and distortion, spatial resolution, and frame rate changes on +video quality, respectively. During training, spatial and temporal rectifiers +are dropped out with some probabilities to render the base quality predictor a +standalone BVQA model, which should work better with the rectifiers. Extensive +experiments on both professionally-generated content and user-generated content +video databases show that our quality model achieves superior or comparable +performance to current methods. Additionally, the modularity of our model +offers an opportunity to analyze existing video quality databases in terms of +their spatial and temporal complexity.",eess.IV,"['eess.IV', 'cs.CV']" +Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation,Ji-Jia Wu · Andy Chia-Hao Chang · Chieh-Yu Chuang · Chun-Pei Chen · Yu-Lun Liu · Min-Hung Chen · Hou-Ning Hu · Yung-Yu Chuang · Yen-Yu Lin, ,https://arxiv.org/abs/2404.04231,,2404.04231.pdf,Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation,"This paper addresses text-supervised semantic segmentation, aiming to learn a +model capable of segmenting arbitrary visual concepts within images by using +only image-text pairs without dense annotations. Existing methods have +demonstrated that contrastive learning on image-text pairs effectively aligns +visual segments with the meanings of texts. We notice that there is a +discrepancy between text alignment and semantic segmentation: A text often +consists of multiple semantic concepts, whereas semantic segmentation strives +to create semantically homogeneous segments. To address this issue, we propose +a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image +and text are jointly decomposed into a set of image regions and a set of word +segments, respectively, and contrastive learning is developed to enforce +region-word alignment. To work with a vision-language model, we present a +prompt learning mechanism that derives an extra representation to highlight an +image segment or a word segment of interest, with which more effective features +can be extracted from that segment. Comprehensive experimental results +demonstrate that our method performs favorably against existing text-supervised +semantic segmentation methods on six benchmark datasets.",cs.CV,['cs.CV'] +Detector-Free Structure from Motion,Xingyi He · Jiaming Sun · Yifan Wang · Sida Peng · Qixing Huang · Hujun Bao · Xiaowei Zhou, ,https://arxiv.org/abs/2306.15669,,2306.15669.pdf,Detector-Free Structure from Motion,"We propose a new structure-from-motion framework to recover accurate camera +poses and point clouds from unordered images. Traditional SfM systems typically +rely on the successful detection of repeatable keypoints across multiple views +as the first step, which is difficult for texture-poor scenes, and poor +keypoint detection may break down the whole SfM system. We propose a new +detector-free SfM framework to draw benefits from the recent success of +detector-free matchers to avoid the early determination of keypoints, while +solving the multi-view inconsistency issue of detector-free matchers. +Specifically, our framework first reconstructs a coarse SfM model from +quantized detector-free matches. Then, it refines the model by a novel +iterative refinement pipeline, which iterates between an attention-based +multi-view matching module to refine feature tracks and a geometry refinement +module to improve the reconstruction accuracy. Experiments demonstrate that the +proposed framework outperforms existing detector-based SfM systems on common +benchmark datasets. We also collect a texture-poor SfM dataset to demonstrate +the capability of our framework to reconstruct texture-poor scenes. Based on +this framework, we take $\textit{first place}$ in Image Matching Challenge +2023.",cs.CV,['cs.CV'] +Simple Semantic-Aided Few-Shot Learning,Hai Zhang · Junzhe Xu · Shanlin Jiang · Zhenan He,https://github.com/zhangdoudou123/SemFew,https://arxiv.org/abs/2311.18649,,2311.18649.pdf,Simple Semantic-Aided Few-Shot Learning,"Learning from a limited amount of data, namely Few-Shot Learning, stands out +as a challenging computer vision task. Several works exploit semantics and +design complicated semantic fusion mechanisms to compensate for rare +representative features within restricted data. However, relying on naive +semantics such as class names introduces biases due to their brevity, while +acquiring extensive semantics from external knowledge takes a huge time and +effort. This limitation severely constrains the potential of semantics in +Few-Shot Learning. In this paper, we design an automatic way called Semantic +Evolution to generate high-quality semantics. The incorporation of high-quality +semantics alleviates the need for complex network structures and learning +algorithms used in previous works. Hence, we employ a simple two-layer network +termed Semantic Alignment Network to transform semantics and visual features +into robust class prototypes with rich discriminative features for few-shot +classification. The experimental results show our framework outperforms all +previous methods on six benchmarks, demonstrating a simple network with +high-quality semantics can beat intricate multi-modal modules on few-shot +classification tasks. Code is available at +https://github.com/zhangdoudou123/SemFew.",cs.CV,['cs.CV'] +iToF-flow-based High Frame Rate Depth Imaging,Yu Meng · Zhou Xue · Xu Chang · Xuemei Hu · Tao Yue, ,https://arxiv.org/abs/2306.17618,,2306.17618.pdf,Polarimetric iToF: Measuring High-Fidelity Depth through Scattering Media,"Indirect time-of-flight (iToF) imaging allows us to capture dense depth +information at a low cost. However, iToF imaging often suffers from multipath +interference (MPI) artifacts in the presence of scattering media, resulting in +severe depth-accuracy degradation. For instance, iToF cameras cannot measure +depth accurately through fog because ToF active illumination scatters back to +the sensor before reaching the farther target surface. In this work, we propose +a polarimetric iToF imaging method that can capture depth information robustly +through scattering media. Our observations on the principle of indirect ToF +imaging and polarization of light allow us to formulate a novel computational +model of scattering-aware polarimetric phase measurements that enables us to +correct MPI errors. We first devise a scattering-aware polarimetric iToF model +that can estimate the phase of unpolarized backscattered light. We then combine +the optical filtering of polarization and our computational modeling of +unpolarized backscattered light via scattering analysis of phase and amplitude. +This allows us to tackle the MPI problem by estimating the scattering energy +through the participating media. We validate our method on an experimental +setup using a customized off-the-shelf iToF camera. Our method outperforms +baseline methods by a significant margin by means of our scattering model and +polarimetric phase measurements.",cs.CV,['cs.CV'] +Perceptual-Oriented Video Frame Interpolation Via Asymmetric Synergistic Blending,Guangyang Wu · Xin Tao · Changlin Li · Wenyi Wang · Xiaohong Liu · Qingqing Zheng, ,https://arxiv.org/abs/2404.06692,,2404.06692.pdf,Perception-Oriented Video Frame Interpolation via Asymmetric Blending,"Previous methods for Video Frame Interpolation (VFI) have encountered +challenges, notably the manifestation of blur and ghosting effects. These +issues can be traced back to two pivotal factors: unavoidable motion errors and +misalignment in supervision. In practice, motion estimates often prove to be +error-prone, resulting in misaligned features. Furthermore, the reconstruction +loss tends to bring blurry results, particularly in misaligned regions. To +mitigate these challenges, we propose a new paradigm called PerVFI +(Perception-oriented Video Frame Interpolation). Our approach incorporates an +Asymmetric Synergistic Blending module (ASB) that utilizes features from both +sides to synergistically blend intermediate features. One reference frame +emphasizes primary content, while the other contributes complementary +information. To impose a stringent constraint on the blending process, we +introduce a self-learned sparse quasi-binary mask which effectively mitigates +ghosting and blur artifacts in the output. Additionally, we employ a +normalizing flow-based generator and utilize the negative log-likelihood loss +to learn the conditional distribution of the output, which further facilitates +the generation of clear and fine details. Experimental results validate the +superiority of PerVFI, demonstrating significant improvements in perceptual +quality compared to existing methods. Codes are available at +\url{https://github.com/mulns/PerVFI}",cs.CV,['cs.CV'] +SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos,Changan Chen · Kumar Ashutosh · Rohit Girdhar · David Harwath · Kristen Grauman,https://vision.cs.utexas.edu/projects/soundingactions/,https://arxiv.org/abs/2404.05206,,,SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos,"We propose a novel self-supervised embedding to learn how actions sound from +narrated in-the-wild egocentric videos. Whereas existing methods rely on +curated data with known audio-visual correspondence, our multimodal +contrastive-consensus coding (MC3) embedding reinforces the associations +between audio, language, and vision when all modality pairs agree, while +diminishing those associations when any one pair does not. We show our approach +can successfully discover how the long tail of human actions sound from +egocentric video, outperforming an array of recent multimodal embedding +techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal +tasks.",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']" +Dynamic LiDAR Re-simulation using Compositional Neural Fields,Hanfeng Wu · Xingxing Zuo · Stefan Leutenegger · Or Litany · Konrad Schindler · Shengyu Huang, ,https://arxiv.org/abs/2312.05247,,2312.05247.pdf,Dynamic LiDAR Re-simulation using Compositional Neural Fields,"We introduce DyNFL, a novel neural field-based approach for high-fidelity +re-simulation of LiDAR scans in dynamic driving scenes. DyNFL processes LiDAR +measurements from dynamic environments, accompanied by bounding boxes of moving +objects, to construct an editable neural field. This field, comprising +separately reconstructed static background and dynamic objects, allows users to +modify viewpoints, adjust object positions, and seamlessly add or remove +objects in the re-simulated scene. A key innovation of our method is the neural +field composition technique, which effectively integrates reconstructed neural +assets from various scenes through a ray drop test, accounting for occlusions +and transparent surfaces. Our evaluation with both synthetic and real-world +environments demonstrates that DyNFL substantially improves dynamic scene LiDAR +simulation, offering a combination of physical fidelity and flexible editing +capabilities.",cs.CV,['cs.CV'] +GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding,Zi-Ting Chou · Sheng-Yu Huang · I-Jieh Liu · Yu-Chiang Frank Wang,https://timchou-ntu.github.io/gsnerf/,https://arxiv.org/abs/2403.03608,,2403.03608.pdf,GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding,"Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance +Fields (NeRF) have emerged as a popular research topic in 3D vision. In this +work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF), +which uniquely takes image semantics into the synthesis process so that both +novel view images and the associated semantic maps can be produced for unseen +scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and +Depth-Guided Visual rendering. The former is able to observe multi-view image +inputs to extract semantic and geometry features from a scene. Guided by the +resulting image geometry information, the latter performs both image and +semantic rendering with improved performances. Our experiments not only confirm +that GSNeRF performs favorably against prior works on both novel-view image and +semantic segmentation synthesis but the effectiveness of our sampling strategy +for visual rendering is further verified.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +MVBench: A Comprehensive Multi-modal Video Understanding Benchmark,Kunchang Li · Yali Wang · Yinan He · Yizhuo Li · Yi Wang · Yi Liu · Zun Wang · Jilan Xu · Guo Chen · Ping Luo · Limin Wang · Yu Qiao,https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2,https://arxiv.org/abs/2311.17005,,2311.17005.pdf,MVBench: A Comprehensive Multi-modal Video Understanding Benchmark,"With the rapid development of Multi-modal Large Language Models (MLLMs), a +number of diagnostic benchmarks have recently emerged to evaluate the +comprehension capabilities of these models. However, most benchmarks +predominantly assess spatial understanding in the static image tasks, while +overlooking temporal understanding in the dynamic video tasks. To alleviate +this issue, we introduce a comprehensive Multi-modal Video understanding +Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot +be effectively solved with a single frame. Specifically, we first introduce a +novel static-to-dynamic method to define these temporal-related tasks. By +transforming various static tasks into dynamic ones, we enable the systematic +generation of video tasks that require a broad spectrum of temporal skills, +ranging from perception to cognition. Then, guided by the task definition, we +automatically convert public video annotations into multiple-choice QA to +evaluate each task. On one hand, such a distinct paradigm allows us to build +MVBench efficiently, without much manual intervention. On the other hand, it +guarantees evaluation fairness with ground-truth video annotations, avoiding +the biased scoring of LLMs. Moreover, we further develop a robust video MLLM +baseline, i.e., VideoChat2, by progressive multi-modal training with diverse +instruction-tuning data. The extensive results on our MVBench reveal that, the +existing MLLMs are far from satisfactory in temporal understanding, while our +VideoChat2 largely surpasses these leading models by over 15% on MVBench. All +models and data are available at https://github.com/OpenGVLab/Ask-Anything.",cs.CV,['cs.CV'] +Unsupervised Gaze Representation Learning from Multi-view Face Images,Yiwei Bao · Feng Lu, ,https://arxiv.org/abs/2309.04506,,2309.04506.pdf,Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition,"Appearance-based gaze estimation has shown great promise in many applications +by using a single general-purpose camera as the input device. However, its +success is highly depending on the availability of large-scale well-annotated +gaze datasets, which are sparse and expensive to collect. To alleviate this +challenge we propose ConGaze, a contrastive learning-based framework that +leverages unlabeled facial images to learn generic gaze-aware representations +across subjects in an unsupervised way. Specifically, we introduce the +gaze-specific data augmentation to preserve the gaze-semantic features and +maintain the gaze consistency, which are proven to be crucial for effective +contrastive gaze representation learning. Moreover, we devise a novel +subject-conditional projection module that encourages a share feature extractor +to learn gaze-aware and generic representations. Our experiments on three +public gaze estimation datasets show that ConGaze outperforms existing +unsupervised learning solutions by 6.7% to 22.5%; and achieves 15.1% to 24.6% +improvement over its supervised learning-based counterpart in cross-dataset +evaluations.",cs.CV,['cs.CV'] +DIOD: Self-Distillation Meets Object Discovery,Sandra Kara · Hejer AMMAR · Julien Denize · Florian Chabot · Quoc Cuong PHAM, ,https://arxiv.org/abs/2311.02633,,2311.02633.pdf,The Background Also Matters: Background-Aware Motion-Guided Objects Discovery,"Recent works have shown that objects discovery can largely benefit from the +inherent motion information in video data. However, these methods lack a proper +background processing, resulting in an over-segmentation of the non-object +regions into random segments. This is a critical limitation given the +unsupervised setting, where object segments and noise are not distinguishable. +To address this limitation we propose BMOD, a Background-aware Motion-guided +Objects Discovery method. Concretely, we leverage masks of moving objects +extracted from optical flow and design a learning mechanism to extend them to +the true foreground composed of both moving and static objects. The background, +a complementary concept of the learned foreground class, is then isolated in +the object discovery process. This enables a joint learning of the objects +discovery task and the object/non-object separation. The conducted experiments +on synthetic and real-world datasets show that integrating our background +handling with various cutting-edge methods brings each time a considerable +improvement. Specifically, we improve the objects discovery performance with a +large margin, while establishing a strong baseline for object/non-object +separation.",cs.CV,['cs.CV'] +$\textbf{LaRE}^2$: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection,Yunpeng Luo · Junlong Du · Ke Yan · Shouhong Ding, ,https://arxiv.org/abs/2403.17465,,2403.17465.pdf,LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection,"The evolution of Diffusion Models has dramatically improved image generation +quality, making it increasingly difficult to differentiate between real and +generated images. This development, while impressive, also raises significant +privacy and security concerns. In response to this, we propose a novel Latent +REconstruction error guided feature REfinement method (LaRE^2) for detecting +the diffusion-generated images. We come up with the Latent Reconstruction Error +(LaRE), the first reconstruction-error based feature in the latent space for +generated image detection. LaRE surpasses existing methods in terms of feature +extraction efficiency while preserving crucial cues required to differentiate +between the real and the fake. To exploit LaRE, we propose an Error-Guided +feature REfinement module (EGRE), which can refine the image feature guided by +LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an +align-then-refine mechanism, which effectively refines the image feature for +generated-image detection from both spatial and channel perspectives. Extensive +experiments on the large-scale GenImage benchmark demonstrate the superiority +of our LaRE^2, which surpasses the best SoTA method by up to 11.9%/12.1% +average ACC/AP across 8 different image generators. LaRE also surpasses +existing methods in terms of feature extraction cost, delivering an impressive +speed enhancement of 8 times.",cs.CV,"['cs.CV', 'cs.AI']" +MindBridge: A Cross-Subject Brain Decoding Framework,Shizun Wang · Songhua Liu · Zhenxiong Tan · Xinchao Wang,https://littlepure2333.github.io/MindBridge/,https://arxiv.org/abs/2404.07850,,2404.07850.pdf,MindBridge: A Cross-Subject Brain Decoding Framework,"Brain decoding, a pivotal field in neuroscience, aims to reconstruct stimuli +from acquired brain signals, primarily utilizing functional magnetic resonance +imaging (fMRI). Currently, brain decoding is confined to a +per-subject-per-model paradigm, limiting its applicability to the same +individual for whom the decoding model is trained. This constraint stems from +three key challenges: 1) the inherent variability in input dimensions across +subjects due to differences in brain size; 2) the unique intrinsic neural +patterns, influencing how different individuals perceive and process sensory +information; 3) limited data availability for new subjects in real-world +scenarios hampers the performance of decoding models. In this paper, we present +a novel approach, MindBridge, that achieves cross-subject brain decoding by +employing only one model. Our proposed framework establishes a generic paradigm +capable of addressing these challenges by introducing biological-inspired +aggregation function and novel cyclic fMRI reconstruction mechanism for +subject-invariant representation learning. Notably, by cycle reconstruction of +fMRI, MindBridge can enable novel fMRI synthesis, which also can serve as +pseudo data augmentation. Within the framework, we also devise a novel +reset-tuning method for adapting a pretrained model to a new subject. +Experimental results demonstrate MindBridge's ability to reconstruct images for +multiple subjects, which is competitive with dedicated subject-specific models. +Furthermore, with limited data for a new subject, we achieve a high level of +decoding accuracy, surpassing that of subject-specific models. This advancement +in cross-subject brain decoding suggests promising directions for wider +applications in neuroscience and indicates potential for more efficient +utilization of limited fMRI data in real-world scenarios. Project page: +https://littlepure2333.github.io/MindBridge",cs.CV,"['cs.CV', 'cs.AI']" +Capturing Closely Interacted Two-Person Motions with Reaction Priors,Qi Fang · Yinghui Fan · Yanjun Li · Junting Dong · Dingwei Wu · Weidong Zhang · Kang Chen, ,https://arxiv.org/abs/2404.05490,,2404.05490.pdf,Two-Person Interaction Augmentation with Skeleton Priors,"Close and continuous interaction with rich contacts is a crucial aspect of +human activities (e.g. hugging, dancing) and of interest in many domains like +activity recognition, motion prediction, character animation, etc. However, +acquiring such skeletal motion is challenging. While direct motion capture is +expensive and slow, motion editing/generation is also non-trivial, as complex +contact patterns with topological and geometric constraints have to be +retained. To this end, we propose a new deep learning method for two-body +skeletal interaction motion augmentation, which can generate variations of +contact-rich interactions with varying body sizes and proportions while +retaining the key geometric/topological relations between two bodies. Our +system can learn effectively from a relatively small amount of data and +generalize to drastically different skeleton sizes. Through exhaustive +evaluation and comparison, we show it can generate high-quality motions, has +strong generalizability and outperforms traditional optimization-based methods +and alternative deep learning solutions.",cs.CV,['cs.CV'] +Text-conditional Attribute Alignment across Latent Spaces for 3D Controllable Face Image Synthesis,FeiFan Xu · Rui Li · Si Wu · Yong Xu · Hau San Wong, ,,https://huggingface.co/papers/2306.17115,,,,,nan +Purified and Unified Steganographic Network,GuoBiao Li · Sheng Li · Zicong Luo · Zhenxing Qian · Xinpeng Zhang,https://github.com/albblgb/PUSNet,https://arxiv.org/abs/2402.17210,,2402.17210.pdf,Purified and Unified Steganographic Network,"Steganography is the art of hiding secret data into the cover media for +covert communication. In recent years, more and more deep neural network +(DNN)-based steganographic schemes are proposed to train steganographic +networks for secret embedding and recovery, which are shown to be promising. +Compared with the handcrafted steganographic tools, steganographic networks +tend to be large in size. It raises concerns on how to imperceptibly and +effectively transmit these networks to the sender and receiver to facilitate +the covert communication. To address this issue, we propose in this paper a +Purified and Unified Steganographic Network (PUSNet). It performs an ordinary +machine learning task in a purified network, which could be triggered into +steganographic networks for secret embedding or recovery using different keys. +We formulate the construction of the PUSNet into a sparse weight filling +problem to flexibly switch between the purified and steganographic networks. We +further instantiate our PUSNet as an image denoising network with two +steganographic networks concealed for secret image embedding and recovery. +Comprehensive experiments demonstrate that our PUSNet achieves good performance +on secret image embedding, secret image recovery, and image denoising in a +single architecture. It is also shown to be capable of imperceptibly carrying +the steganographic networks in a purified network. Code is available at +\url{https://github.com/albblgb/PUSNet}",cs.CR,"['cs.CR', 'cs.CV']" +Synergistic Global-space Camera and Human Reconstruction from Videos,Yizhou Zhao · Tuanfeng Y. Wang · Bhiksha Raj · Min Xu · Jimei Yang · Chun-Hao P. Huang,https://paulchhuang.github.io/synchmr/,https://arxiv.org/abs/2405.14855,,2405.14855.pdf,Synergistic Global-space Camera and Human Reconstruction from Videos,"Remarkable strides have been made in reconstructing static scenes or human +bodies from monocular videos. Yet, the two problems have largely been +approached independently, without much synergy. Most visual SLAM methods can +only reconstruct camera trajectories and scene structures up to scale, while +most HMR methods reconstruct human meshes in metric scale but fall short in +reasoning with cameras and scenes. This work introduces Synergistic Camera and +Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically, +we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and +scene point clouds using camera-frame HMR as a strong prior, addressing depth, +scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we +further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by +incorporating spatio-temporal coherency and dynamic scene constraints. +Together, they lead to consistent reconstructions of camera trajectories, human +meshes, and dense scene point clouds in a common world frame. Project page: +https://paulchhuang.github.io/synchmr",cs.CV,"['cs.CV', 'cs.AI']" +VRetouchEr: Learning Cross-frame Feature Interdependence with Imperfection Flow for Face Retouching in Videos,Wen Xue · Le Jiang · Lianxin Xie · Si Wu · Yong Xu · Hau San Wong, ,,https://ojs.aaai.org/index.php/AAAI/article/view/28404,,,,,nan +Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning,Xialei Liu · Jiang-Tian Zhai · Andrew Bagdanov · Ke Li · Ming-Ming Cheng, ,,https://www.youtube.com/watch?v=5VfpqIwrbWM,,,,,nan +HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction,Yi ZHOU · Hui Zhang · Jiaqian Yu · yifan yang · Sangil Jung · Seung-In Park · ByungIn Yoo, ,https://arxiv.org/abs/2403.08639,,2403.08639.pdf,HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction,"Vectorized High-Definition (HD) map construction requires predictions of the +category and point coordinates of map elements (e.g. road boundary, lane +divider, pedestrian crossing, etc.). State-of-the-art methods are mainly based +on point-level representation learning for regressing accurate point +coordinates. However, this pipeline has limitations in obtaining element-level +information and handling element-level failures, e.g. erroneous element shape +or entanglement between elements. To tackle the above issues, we propose a +simple yet effective HybrId framework named HIMap to sufficiently learn and +interact both point-level and element-level information. Concretely, we +introduce a hybrid representation called HIQuery to represent all map elements, +and propose a point-element interactor to interactively extract and encode the +hybrid information of elements, e.g. point position and element shape, into the +HIQuery. Additionally, we present a point-element consistency constraint to +enhance the consistency between the point-level and element-level information. +Finally, the output point-element integrated HIQuery can be directly converted +into map elements' class, point coordinates, and mask. We conduct extensive +experiments and consistently outperform previous methods on both nuScenes and +Argoverse2 datasets. Notably, our method achieves $77.8$ mAP on the nuScenes +dataset, remarkably superior to previous SOTAs by $8.3$ mAP at least.",cs.CV,['cs.CV'] +Making Vision Transformers Truly Shift-Equivariant,Renan A. Rojas-Gomez · Teck-Yian Lim · Minh Do · Raymond A. Yeh,https://renanrojasg.github.io/shifteq_vit/,,https://www.youtube.com/watch?v=PBNdb93NqiA,,,,,nan +Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration,Hong Chen · Pei Yan · sihe xiang · Yihua Tan, ,https://arxiv.org/abs/2404.14034,,2404.14034.pdf,PointDifformer: Robust Point Cloud Registration With Neural Diffusion and Transformer,"Point cloud registration is a fundamental technique in 3-D computer vision +with applications in graphics, autonomous driving, and robotics. However, +registration tasks under challenging conditions, under which noise or +perturbations are prevalent, can be difficult. We propose a robust point cloud +registration approach that leverages graph neural partial differential +equations (PDEs) and heat kernel signatures. Our method first uses graph neural +PDE modules to extract high dimensional features from point clouds by +aggregating information from the 3-D point neighborhood, thereby enhancing the +robustness of the feature representations. Then, we incorporate heat kernel +signatures into an attention mechanism to efficiently obtain corresponding +keypoints. Finally, a singular value decomposition (SVD) module with learnable +weights is used to predict the transformation between two point clouds. +Empirical experiments on a 3-D point cloud dataset demonstrate that our +approach not only achieves state-of-the-art performance for point cloud +registration but also exhibits better robustness to additive noise or 3-D shape +perturbations.",cs.CV,['cs.CV'] +Generative Multi-modal Models are Good Class Incremental Learners,Xusheng Cao · Haori Lu · Linlan Huang · Xialei Liu · Ming-Ming Cheng, ,https://arxiv.org/abs/2403.18383,,2403.18383.pdf,Generative Multi-modal Models are Good Class-Incremental Learners,"In class-incremental learning (CIL) scenarios, the phenomenon of catastrophic +forgetting caused by the classifier's bias towards the current task has long +posed a significant challenge. It is mainly caused by the characteristic of +discriminative models. With the growing popularity of the generative +multi-modal models, we would explore replacing discriminative models with +generative ones for CIL. However, transitioning from discriminative to +generative models requires addressing two key challenges. The primary challenge +lies in transferring the generated textual information into the classification +of distinct categories. Additionally, it requires formulating the task of CIL +within a generative framework. To this end, we propose a novel generative +multi-modal model (GMM) framework for class-incremental learning. Our approach +directly generates labels for images using an adapted generative model. After +obtaining the detailed text, we use a text encoder to extract text features and +employ feature matching to determine the most similar label as the +classification prediction. In the conventional CIL settings, we achieve +significantly better results in long-sequence task scenarios. Under the +Few-shot CIL setting, we have improved by at least 14\% accuracy over all the +current state-of-the-art methods with significantly less forgetting. Our code +is available at \url{https://github.com/DoubleClass/GMM}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models,Rongjie Li · Songyang Zhang · Dahua Lin · Kai Chen · Xuming He, ,https://arxiv.org/abs/2404.00906,,2404.00906.pdf,From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models,"Scene graph generation (SGG) aims to parse a visual scene into an +intermediate graph representation for downstream reasoning tasks. Despite +recent advancements, existing methods struggle to generate scene graphs with +novel visual relation concepts. To address this challenge, we introduce a new +open-vocabulary SGG framework based on sequence generation. Our framework +leverages vision-language pre-trained models (VLM) by incorporating an +image-to-graph generation paradigm. Specifically, we generate scene graph +sequences via image-to-text generation with VLM and then construct scene graphs +from these sequences. By doing so, we harness the strong capabilities of VLM +for open-vocabulary SGG and seamlessly integrate explicit relational modeling +for enhancing the VL tasks. Experimental results demonstrate that our design +not only achieves superior performance with an open vocabulary but also +enhances downstream vision-language task performance through explicit relation +modeling knowledge.",cs.CV,['cs.CV'] +Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation,Shanshan Zhong · Zhongzhan Huang · Shanghua Gao · Wushao Wen · Liang Lin · Marinka Zitnik · Pan Zhou,https://zhongshsh.github.io/CLoT/,https://arxiv.org/abs/2312.02439,,2312.02439.pdf,Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation,"Chain-of-Thought (CoT) guides large language models (LLMs) to reason +step-by-step, and can motivate their logical reasoning ability. While effective +for logical tasks, CoT is not conducive to creative problem-solving which often +requires out-of-box thoughts and is crucial for innovation advancements. In +this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a +non-sequential, creative paradigm involving strong associations and knowledge +leaps. To this end, we study LLMs on the popular Oogiri game which needs +participants to have good creativity and strong associative thinking for +responding unexpectedly and humorously to the given image, text, or both, and +thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the +Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset +which contains over 130,000 samples from the Oogiri game, and observe the +insufficient LoT ability or failures of most existing LLMs on the Oogiri game. +Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve +LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into +LoT-oriented instruction tuning data to train pretrained LLM for achieving +certain LoT humor generation and discrimination abilities. Then CLoT designs an +explorative self-refinement that encourages the LLM to generate more creative +LoT data via exploring parallels between seemingly unrelated concepts and +selects high-quality data to train itself for self-refinement. CLoT not only +excels in humor generation in the Oogiri game but also boosts creative +abilities in various tasks like cloud guessing game and divergent association +task. These findings advance our understanding and offer a pathway to improve +LLMs' creative capacities for innovative applications across domains. The +dataset, code, and models will be released online. +https://zhongshsh.github.io/CLoT/.",cs.AI,"['cs.AI', 'cs.CL', 'cs.CV']" +Enhancing Visual Continual Learning with Language-Guided Supervision,Bolin Ni · Hongbo Zhao · Chenghao Zhang · Ke Hu · Gaofeng Meng · Zhaoxiang Zhang · Shiming Xiang, ,https://arxiv.org/abs/2403.16124,,2403.16124.pdf,Enhancing Visual Continual Learning with Language-Guided Supervision,"Continual learning (CL) aims to empower models to learn new tasks without +forgetting previously acquired knowledge. Most prior works concentrate on the +techniques of architectures, replay data, regularization, \etc. However, the +category name of each class is largely neglected. Existing methods commonly +utilize the one-hot labels and randomly initialize the classifier head. We +argue that the scarce semantic information conveyed by the one-hot labels +hampers the effective knowledge transfer across tasks. In this paper, we +revisit the role of the classifier head within the CL paradigm and replace the +classifier with semantic knowledge from pretrained language models (PLMs). +Specifically, we use PLMs to generate semantic targets for each class, which +are frozen and serve as supervision signals during training. Such targets fully +consider the semantic correlation between all classes across tasks. Empirical +studies show that our approach mitigates forgetting by alleviating +representation drifting and facilitating knowledge transfer across tasks. The +proposed method is simple to implement and can seamlessly be plugged into +existing methods with negligible adjustments. Extensive experiments based on +eleven mainstream baselines demonstrate the effectiveness and generalizability +of our approach to various protocols. For example, under the class-incremental +learning setting on ImageNet-100, our method significantly improves the Top-1 +accuracy by 3.2\% to 6.1\% while reducing the forgetting rate by 2.6\% to +13.1\%.",cs.CV,['cs.CV'] +Learned Trajectory Embedding for Subspace Clustering,Yaroslava Lochman · Christopher Zach · Carl Olsson,https://ylochman.github.io/trajectory-embedding,,https://link.springer.com/article/10.1007/s44267-024-00043-0,,,,,nan +Denoising Point Clouds in Latent Space via Graph Convolution and Invertible Neural Network,Aihua Mao · Biao Yan · Zijing Ma · Ying He, ,https://arxiv.org/abs/2401.09721,,2401.09721.pdf,Fast graph-based denoising for point cloud color information,"Point clouds are utilized in various 3D applications such as cross-reality +(XR) and realistic 3D displays. In some applications, e.g., for live streaming +using a 3D point cloud, real-time point cloud denoising methods are required to +enhance the visual quality. However, conventional high-precision denoising +methods cannot be executed in real time for large-scale point clouds owing to +the complexity of graph constructions with K nearest neighbors and noise level +estimation. This paper proposes a fast graph-based denoising (FGBD) for a +large-scale point cloud. First, high-speed graph construction is achieved by +scanning a point cloud in various directions and searching adjacent +neighborhoods on the scanning lines. Second, we propose a fast noise level +estimation method using eigenvalues of the covariance matrix on a graph. +Finally, we also propose a new low-cost filter selection method to enhance +denoising accuracy to compensate for the degradation caused by the acceleration +algorithms. In our experiments, we succeeded in reducing the processing time +dramatically while maintaining accuracy relative to conventional denoising +methods. Denoising was performed at 30fps, with frames containing approximately +1 million points.",cs.CV,"['cs.CV', 'eess.IV', 'eess.SP']" +LASO: Language-guided Affordance Segmentation on 3D Object,Yicong Li · Na Zhao · Junbin Xiao · Chun Feng · Xiang Wang · Tat-seng Chua, ,https://arxiv.org/abs/2309.10911,,2309.10911.pdf,Language-Conditioned Affordance-Pose Detection in 3D Point Clouds,"Affordance detection and pose estimation are of great importance in many +robotic applications. Their combination helps the robot gain an enhanced +manipulation capability, in which the generated pose can facilitate the +corresponding affordance task. Previous methods for affodance-pose joint +learning are limited to a predefined set of affordances, thus limiting the +adaptability of robots in real-world environments. In this paper, we propose a +new method for language-conditioned affordance-pose joint learning in 3D point +clouds. Given a 3D point cloud object, our method detects the affordance region +and generates appropriate 6-DoF poses for any unconstrained affordance label. +Our method consists of an open-vocabulary affordance detection branch and a +language-guided diffusion model that generates 6-DoF poses based on the +affordance text. We also introduce a new high-quality dataset for the task of +language-driven affordance-pose joint learning. Intensive experimental results +demonstrate that our proposed method works effectively on a wide range of +open-vocabulary affordances and outperforms other baselines by a large margin. +In addition, we illustrate the usefulness of our method in real-world robotic +applications. Our code and dataset are publicly available at +https://3DAPNet.github.io",cs.RO,['cs.RO'] +MonoCD: Monocular 3D Object Detection with Complementary Depths,Longfei Yan · Pei Yan · Shengzhou Xiong · Xuanyu Xiang · Yihua Tan,https://github.com/elvintanhust/MonoCD,https://arxiv.org/abs/2404.03181v1,,2404.03181v1.pdf,MonoCD: Monocular 3D Object Detection with Complementary Depths,"Monocular 3D object detection has attracted widespread attention due to its +potential to accurately obtain object 3D localization from a single image at a +low cost. Depth estimation is an essential but challenging subtask of monocular +3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods +explore multiple local depth clues such as object heights and keypoints and +then formulate the object depth estimation as an ensemble of multiple depth +predictions to mitigate the insufficiency of single-depth information. However, +the errors of existing multiple depths tend to have the same sign, which +hinders them from neutralizing each other and limits the overall accuracy of +combined depth. To alleviate this problem, we propose to increase the +complementarity of depths with two novel designs. First, we add a new depth +prediction branch named complementary depth that utilizes global and efficient +depth clues from the entire image rather than the local clues to reduce the +correlation of depth predictions. Second, we propose to fully exploit the +geometric relations between multiple depth clues to achieve complementarity in +form. Benefiting from these designs, our method achieves higher +complementarity. Experiments on the KITTI benchmark demonstrate that our method +achieves state-of-the-art performance without introducing extra data. In +addition, complementary depth can also be a lightweight and plug-and-play +module to boost multiple existing monocular 3d object detectors. Code is +available at https://github.com/elvintanhust/MonoCD.",cs.CV,['cs.CV'] +All Rivers Run to the Sea: Private Learning with Asymmetric Flows,Yue Niu · Ramy E. Ali · Saurav Prakash · Salman Avestimehr, ,https://arxiv.org/abs/2312.05264,,2312.05264.pdf,All Rivers Run to the Sea: Private Learning with Asymmetric Flows,"Data privacy is of great concern in cloud machine-learning service platforms, +when sensitive data are exposed to service providers. While private computing +environments (e.g., secure enclaves), and cryptographic approaches (e.g., +homomorphic encryption) provide strong privacy protection, their computing +performance still falls short compared to cloud GPUs. To achieve privacy +protection with high computing performance, we propose Delta, a new private +training and inference framework, with comparable model performance as +non-private centralized training. Delta features two asymmetric data flows: the +main information-sensitive flow and the residual flow. The main part flows into +a small model while the residuals are offloaded to a large model. Specifically, +Delta embeds the information-sensitive representations into a low-dimensional +space while pushing the information-insensitive part into high-dimension +residuals. To ensure privacy protection, the low-dimensional +information-sensitive part is secured and fed to a small model in a private +environment. On the other hand, the residual part is sent to fast cloud GPUs, +and processed by a large model. To further enhance privacy and reduce the +communication cost, Delta applies a random binary quantization technique along +with a DP-based technique to the residuals before sharing them with the public +platform. We theoretically show that Delta guarantees differential privacy in +the public environment and greatly reduces the complexity in the private +environment. We conduct empirical analyses on CIFAR-10, CIFAR-100 and ImageNet +datasets and ResNet-18 and ResNet-34, showing that Delta achieves strong +privacy protection, fast training, and inference without significantly +compromising the model utility.",cs.CR,"['cs.CR', 'cs.LG']" +PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness,Siyao Jiang · Huisi Wu · Junyang Chen · Qin Zhang · Jing Qin, ,,https://link.springer.com/article/10.1007/s11517-023-02970-4,,,,,nan +Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge,Haoxiang Ma · Modi Shi · Boyang GAO · Di Huang, ,https://arxiv.org/abs/2404.01727v1,,2404.01727v1.pdf,Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge,"We focus on the generalization ability of the 6-DoF grasp detection method in +this paper. While learning-based grasp detection methods can predict grasp +poses for unseen objects using the grasp distribution learned from the training +set, they often exhibit a significant performance drop when encountering +objects with diverse shapes and structures. To enhance the grasp detection +methods' generalization ability, we incorporate domain prior knowledge of +robotic grasping, enabling better adaptation to objects with significant shape +and structure differences. More specifically, we employ the physical constraint +regularization during the training phase to guide the model towards predicting +grasps that comply with the physical rule on grasping. For the unstable grasp +poses predicted on novel objects, we design a contact-score joint optimization +using the projection contact map to refine these poses in cluttered scenarios. +Extensive experiments conducted on the GraspNet-1billion benchmark demonstrate +a substantial performance gain on the novel object set and the real-world +grasping experiments also demonstrate the effectiveness of our generalizing +6-DoF grasp detection method.",cs.RO,"['cs.RO', 'cs.CV']" +Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline,Xiao Wang · Shiao Wang · Chuanming Tang · Lin Zhu · Bo Jiang · Yonghong Tian · Jin Tang, ,https://arxiv.org/abs/2309.14611,,2309.14611.pdf,Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline,"Tracking using bio-inspired event cameras has drawn more and more attention +in recent years. Existing works either utilize aligned RGB and event data for +accurate tracking or directly learn an event-based tracker. The first category +needs more cost for inference and the second one may be easily influenced by +noisy events or sparse spatial resolution. In this paper, we propose a novel +hierarchical knowledge distillation framework that can fully utilize +multi-modal / multi-view information during training to facilitate knowledge +transfer, enabling us to achieve high-speed and low-latency visual tracking +during testing by using only event signals. Specifically, a teacher +Transformer-based multi-modal tracking framework is first trained by feeding +the RGB frame and event stream simultaneously. Then, we design a new +hierarchical knowledge distillation strategy which includes pairwise +similarity, feature representation, and response maps-based knowledge +distillation to guide the learning of the student Transformer network. +Moreover, since existing event-based tracking datasets are all low-resolution +($346 \times 260$), we propose the first large-scale high-resolution ($1280 +\times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide +range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. +Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and +our newly proposed high-resolution EventVOT dataset fully validated the +effectiveness of our proposed method. The dataset, evaluation toolkit, and +source code are available on +\url{https://github.com/Event-AHU/EventVOT_Benchmark}",cs.CV,"['cs.CV', 'cs.NE']" +An Empirical Study of Scaling Law for Scene Text Recognition,Miao Rang · Zhenni Bi · Chuanjian Liu · Yunhe Wang · Kai Han, ,https://arxiv.org/abs/2401.00028,,2401.00028.pdf,An Empirical Study of Scaling Law for OCR,"The laws of model size, data volume, computation and model performance have +been extensively studied in the field of Natural Language Processing (NLP). +However, the scaling laws in Optical Character Recognition (OCR) have not yet +been investigated. To address this, we conducted comprehensive studies that +involved examining the correlation between performance and the scale of models, +data volume and computation in the field of text recognition.Conclusively, the +study demonstrates smooth power laws between performance and model size, as +well as training data volume, when other influencing factors are held constant. +Additionally, we have constructed a large-scale dataset called REBU-Syn, which +comprises 6 million real samples and 18 million synthetic samples. Based on our +scaling law and new dataset, we have successfully trained a scene text +recognition model, achieving a new state-ofthe-art on 6 common test benchmarks +with a top-1 average accuracy of 97.42%. The models and dataset are publicly +available at https://github.com/large-ocr-model/large-ocr-model.github.io.",cs.CV,['cs.CV'] +Dual-scale Transformer for Large-scale Single-Pixel Imaging,Gang Qu · Ping Wang · Xin Yuan, ,https://arxiv.org/abs/2404.05001,,2404.05001.pdf,Dual-Scale Transformer for Large-Scale Single-Pixel Imaging,"Single-pixel imaging (SPI) is a potential computational imaging technique +which produces image by solving an illposed reconstruction problem from few +measurements captured by a single-pixel detector. Deep learning has achieved +impressive success on SPI reconstruction. However, previous poor reconstruction +performance and impractical imaging model limit its real-world applications. In +this paper, we propose a deep unfolding network with hybrid-attention +Transformer on Kronecker SPI model, dubbed HATNet, to improve the imaging +quality of real SPI cameras. Specifically, we unfold the computation graph of +the iterative shrinkagethresholding algorithm (ISTA) into two alternative +modules: efficient tensor gradient descent and hybrid-attention multiscale +denoising. By virtue of Kronecker SPI, the gradient descent module can avoid +high computational overheads rooted in previous gradient descent modules based +on vectorized SPI. The denoising module is an encoder-decoder architecture +powered by dual-scale spatial attention for high- and low-frequency aggregation +and channel attention for global information recalibration. Moreover, we build +a SPI prototype to verify the effectiveness of the proposed method. Extensive +experiments on synthetic and real data demonstrate that our method achieves the +state-of-the-art performance. The source code and pre-trained models are +available at https://github.com/Gang-Qu/HATNet-SPI.",cs.CV,['cs.CV'] +Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching,Rui Gong · Weide Liu · ZAIWANG GU · Xulei Yang · Jun Cheng, ,https://arxiv.org/abs/2402.19270,,2402.19270.pdf,Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching,"Geometric knowledge has been shown to be beneficial for the stereo matching +task. However, prior attempts to integrate geometric insights into stereo +matching algorithms have largely focused on geometric knowledge from single +images while crucial cross-view factors such as occlusion and matching +uniqueness have been overlooked. To address this gap, we propose a novel +Intra-view and Cross-view Geometric knowledge learning Network (ICGNet), +specifically crafted to assimilate both intra-view and cross-view geometric +knowledge. ICGNet harnesses the power of interest points to serve as a channel +for intra-view geometric understanding. Simultaneously, it employs the +correspondences among these points to capture cross-view geometric +relationships. This dual incorporation empowers the proposed ICGNet to leverage +both intra-view and cross-view geometric knowledge in its learning process, +substantially improving its ability to estimate disparities. Our extensive +experiments demonstrate the superiority of the ICGNet over contemporary leading +models.",cs.CV,['cs.CV'] +ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks,Kai Han · Yunhe Wang · Jianyuan Guo · Enhua Wu,https://parameternet.github.io/,https://arxiv.org/abs/2306.14525,,2306.14525.pdf,ParameterNet: Parameters Are All You Need,"The large-scale visual pretraining has significantly improve the performance +of large vision models. However, we observe the \emph{low FLOPs pitfall} that +the existing low-FLOPs models cannot benefit from large-scale pretraining. In +this paper, we introduce a novel design principle, termed ParameterNet, aimed +at augmenting the number of parameters in large-scale visual pretraining models +while minimizing the increase in FLOPs. We leverage dynamic convolutions to +incorporate additional parameters into the networks with only a marginal rise +in FLOPs. The ParameterNet approach allows low-FLOPs networks to take advantage +of large-scale visual pretraining. Furthermore, we extend the ParameterNet +concept to the language domain to enhance inference results while preserving +inference speed. Experiments on the large-scale ImageNet-22K have shown the +superiority of our ParameterNet scheme. For example, ParameterNet-600M can +achieve higher accuracy on ImageNet than the widely-used Swin Transformer +(81.6\% \emph{vs.} 80.9\%) and has much lower FLOPs (0.6G \emph{vs.} 4.5G). In +the language domain, LLaMA-1B enhanced with ParameterNet achieves 2\% higher +accuracy over vanilla LLaMA. The code will be released at +\url{https://parameternet.github.io/}.",cs.CV,['cs.CV'] +Relational Matching for Weakly Semi-Supervised Oriented Object Detection,Wenhao Wu · Hau San Wong · Si Wu · Tianyou Zhang, ,,https://paperswithcode.com/paper/weakly-semi-supervised-object-detection-in,,,,,nan +A2XP: Towards Private Domain Generalization,Geunhyeok Yu · Hyoseok Hwang,https://airlabkhu.github.io/A2XP/,https://arxiv.org/abs/2311.10339,,2311.10339.pdf,A2XP: Towards Private Domain Generalization,"Deep Neural Networks (DNNs) have become pivotal in various fields, especially +in computer vision, outperforming previous methodologies. A critical challenge +in their deployment is the bias inherent in data across different domains, such +as image style and environmental conditions, leading to domain gaps. This +necessitates techniques for learning general representations from biased +training data, known as domain generalization. This paper presents Attend to +eXpert Prompts (A2XP), a novel approach for domain generalization that +preserves the privacy and integrity of the network architecture. A2XP consists +of two phases: Expert Adaptation and Domain Generalization. In the first phase, +prompts for each source domain are optimized to guide the model towards the +optimal direction. In the second phase, two embedder networks are trained to +effectively amalgamate these expert prompts, aiming for an optimal output. Our +extensive experiments demonstrate that A2XP achieves state-of-the-art results +over existing non-private domain generalization methods. The experimental +results validate that the proposed approach not only tackles the domain +generalization challenge in DNNs but also offers a privacy-preserving, +efficient solution to the broader field of computer vision.",cs.CV,['cs.CV'] +Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding,Hoang-Quan Nguyen · Thanh-Dat Truong · Xuan-Bac Nguyen · Ashley Dowling · Xin Li · Khoa Luu,https://uark-cviu.github.io/projects/insect_foundation.html,https://arxiv.org/abs/2311.15206,,2311.15206.pdf,Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding,"In precision agriculture, the detection and recognition of insects play an +essential role in the ability of crops to grow healthy and produce a +high-quality yield. The current machine vision model requires a large volume of +data to achieve high performance. However, there are approximately 5.5 million +different insect species in the world. None of the existing insect datasets can +cover even a fraction of them due to varying geographic locations and +acquisition costs. In this paper, we introduce a novel ""Insect-1M"" dataset, a +game-changing resource poised to revolutionize insect-related foundation model +training. Covering a vast spectrum of insect species, our dataset, including 1 +million images with dense identification labels of taxonomy hierarchy and +insect descriptions, offers a panoramic view of entomology, enabling foundation +models to comprehend visual and semantic information about insects like never +before. Then, to efficiently establish an Insect Foundation Model, we develop a +micro-feature self-supervised learning method with a Patch-wise Relevant +Attention mechanism capable of discerning the subtle differences among insect +images. In addition, we introduce Description Consistency loss to improve +micro-feature modeling via insect descriptions. Through our experiments, we +illustrate the effectiveness of our proposed approach in insect modeling and +achieve State-of-the-Art performance on standard benchmarks of insect-related +tasks. Our Insect Foundation Model and Dataset promise to empower the next +generation of insect-related vision models, bringing them closer to the +ultimate goal of precision agriculture.",cs.CV,['cs.CV'] +PostureHMR: Posture Transformation for 3D Human Mesh Recovery,Yu-Pei Song · Xiao WU · Zhaoquan Yuan · Jian-Jun Qiao · Qiang Peng, ,https://arxiv.org/abs/2403.12473,,2403.12473.pdf,PostoMETRO: Pose Token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery,"With the recent advancements in single-image-based human mesh recovery, there +is a growing interest in enhancing its performance in certain extreme +scenarios, such as occlusion, while maintaining overall model accuracy. +Although obtaining accurately annotated 3D human poses under occlusion is +challenging, there is still a wealth of rich and precise 2D pose annotations +that can be leveraged. However, existing works mostly focus on directly +leveraging 2D pose coordinates to estimate 3D pose and mesh. In this paper, we +present PostoMETRO($\textbf{Pos}$e $\textbf{to}$ken enhanced $\textbf{ME}$sh +$\textbf{TR}$ansf$\textbf{O}$rmer), which integrates occlusion-resilient 2D +pose representation into transformers in a token-wise manner. Utilizing a +specialized pose tokenizer, we efficiently condense 2D pose data to a compact +sequence of pose tokens and feed them to the transformer together with the +image tokens. This process not only ensures a rich depiction of texture from +the image but also fosters a robust integration of pose and image information. +Subsequently, these combined tokens are queried by vertex and joint tokens to +decode 3D coordinates of mesh vertices and human joints. Facilitated by the +robust pose token representation and the effective combination, we are able to +produce more precise 3D coordinates, even under extreme scenarios like +occlusion. Experiments on both standard and occlusion-specific benchmarks +demonstrate the effectiveness of PostoMETRO. Qualitative results further +illustrate the clarity of how 2D pose can help 3D reconstruction. Code will be +made available.",cs.CV,['cs.CV'] +InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning,Jing Shi · Wei Xiong · Zhe Lin · HyunJoon Jung, ,https://arxiv.org/html/2403.11284v1,,2403.11284v1.pdf,Fast Personalized Text-to-Image Syntheses With Attention Injection,"Currently, personalized image generation methods mostly require considerable +time to finetune and often overfit the concept resulting in generated images +that are similar to custom concepts but difficult to edit by prompts. We +propose an effective and fast approach that could balance the text-image +consistency and identity consistency of the generated image and reference +image. Our method can generate personalized images without any fine-tuning +while maintaining the inherent text-to-image generation ability of diffusion +models. Given a prompt and a reference image, we merge the custom concept into +generated images by manipulating cross-attention and self-attention layers of +the original diffusion model to generate personalized images that match the +text description. Comprehensive experiments highlight the superiority of our +method.",cs.CV,['cs.CV'] +Exact Fusion via Feature Distribution Matching for Few-shot Image Generation,Yingbo Zhou · Yutong Ye · Pengyu Zhang · Xian Wei · Mingsong Chen, ,https://arxiv.org/abs/2307.14638v1,,2307.14638v1.pdf,EqGAN: Feature Equalization Fusion for Few-shot Image Generation,"Due to the absence of fine structure and texture information, existing +fusion-based few-shot image generation methods suffer from unsatisfactory +generation quality and diversity. To address this problem, we propose a novel +feature Equalization fusion Generative Adversarial Network (EqGAN) for few-shot +image generation. Unlike existing fusion strategies that rely on either deep +features or local representations, we design two separate branches to fuse +structures and textures by disentangling encoded features into shallow and deep +contents. To refine image contents at all feature levels, we equalize the fused +structure and texture semantics at different scales and supplement the decoder +with richer information by skip connections. Since the fused structures and +textures may be inconsistent with each other, we devise a consistent +equalization loss between the equalized features and the intermediate output of +the decoder to further align the semantics. Comprehensive experiments on three +public datasets demonstrate that, EqGAN not only significantly improves +generation performance with FID score (by up to 32.7%) and LPIPS score (by up +to 4.19%), but also outperforms the state-of-the-arts in terms of accuracy (by +up to 1.97%) for downstream classification tasks.",cs.CV,['cs.CV'] +Data Poisoning based Backdoor Attacks to Contrastive Learning,Jinghuai Zhang · Hongbin Liu · Jinyuan Jia · Neil Zhenqiang Gong,https://github.com/jzhang538/CorruptEncoder,,,,,,,nan +GenesisTex: Adapting Image Denoising Diffusion to Texture Space,Chenjian Gao · Boyan Jiang · Xinghui Li · YingPeng Zhang · Qian Yu,https://cjeen.github.io/GenesisTexPaper/,https://arxiv.org/abs/2403.17782,,2403.17782.pdf,GenesisTex: Adapting Image Denoising Diffusion to Texture Space,"We present GenesisTex, a novel method for synthesizing textures for 3D +geometries from text descriptions. GenesisTex adapts the pretrained image +diffusion model to texture space by texture space sampling. Specifically, we +maintain a latent texture map for each viewpoint, which is updated with +predicted noise on the rendering of the corresponding viewpoint. The sampled +latent texture maps are then decoded into a final texture map. During the +sampling process, we focus on both global and local consistency across multiple +viewpoints: global consistency is achieved through the integration of style +consistency mechanisms within the noise prediction network, and low-level +consistency is achieved by dynamically aligning latent textures. Finally, we +apply reference-based inpainting and img2img on denser views for texture +refinement. Our approach overcomes the limitations of slow optimization in +distillation-based methods and instability in inpainting-based methods. +Experiments on meshes from various sources demonstrate that our method +surpasses the baseline methods quantitatively and qualitatively.",cs.CV,"['cs.CV', 'cs.GR']" +On Scaling up a Multilingual Vision and Language Model,Xi Chen · Josip Djolonga · Piotr Padlewski · Basil Mustafa · Soravit Changpinyo · Jialin Wu · Carlos Riquelme Ruiz · Sebastian Goodman · Xiao Wang · Yi Tay · Siamak Shakeri · Mostafa Dehghani · Daniel Salz · Mario Lučić · Michael Tschannen · Arsha Nagrani · Hexiang Hu · Mandar Joshi · Bo Pang · Ceslee Montgomery · Paulina Pietrzyk · Marvin Ritter · AJ Piergiovanni · Matthias Minderer · Filip Pavetic · Austin Waters · Gang Li · Ibrahim Alabdulmohsin · Lucas Beyer · Julien Amelot · Kenton Lee · Andreas Steiner · Yang Li · Daniel Keysers · Anurag Arnab · Yuanzhong Xu · Keran Rong · Alexander Kolesnikov · Mojtaba Seyedhosseini · Anelia Angelova · Xiaohua Zhai · Neil Houlsby · Radu Soricut, ,https://ar5iv.labs.arxiv.org/html/2312.07533,,2312.07533.pdf,VILA: On Pre-training for Visual Language Models,"Visual language models (VLMs) rapidly progressed with the recent success of +large language models. There have been growing efforts on visual instruction +tuning to extend the LLM with visual inputs, but lacks an in-depth study of the +visual language pre-training process, where the model learns to perform joint +modeling on both modalities. In this work, we examine the design options for +VLM pre-training by augmenting LLM towards VLM through step-by-step +controllable comparisons. We introduce three main findings: (1) freezing LLMs +during pre-training can achieve decent zero-shot performance, but lack +in-context learning capability, which requires unfreezing the LLM; (2) +interleaved pre-training data is beneficial whereas image-text pairs alone are +not optimal; (3) re-blending text-only instruction data to image-text data +during instruction fine-tuning not only remedies the degradation of text-only +tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe +we build VILA, a Visual Language model family that consistently outperforms the +state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells +and whistles. Multi-modal pre-training also helps unveil appealing properties +of VILA, including multi-image reasoning, enhanced in-context learning, and +better world knowledge.",cs.CV,['cs.CV'] +$V_kD:$ Improving knowledge distillation using orthogonal projections,Roy Miles · Ismail Elezi · Jiankang Deng, ,https://arxiv.org/abs/2403.06213,,2403.06213.pdf,$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections,"Knowledge distillation is an effective method for training small and +efficient deep learning models. However, the efficacy of a single method can +degenerate when transferring to other tasks, modalities, or even other +architectures. To address this limitation, we propose a novel constrained +feature distillation method. This method is derived from a small set of core +principles, which results in two emerging components: an orthogonal projection +and a task-specific normalisation. Equipped with both of these components, our +transformer models can outperform all previous methods on ImageNet and reach up +to a 4.4% relative improvement over the previous state-of-the-art methods. To +further demonstrate the generality of our method, we apply it to object +detection and image generation, whereby we obtain consistent and substantial +performance improvements over state-of-the-art. Code and models are publicly +available: https://github.com/roymiles/vkd",cs.CV,"['cs.CV', 'cs.AI']" +Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods,Chenfan Qu · Yiwu Zhong · Chongyu Liu · Guitao Xu · Dezhi Peng · Fengjun Guo · Lianwen Jin, ,https://arxiv.org/abs/2309.01858,,2309.01858.pdf,Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations,"Fine-grained and instance-level recognition methods are commonly trained and +evaluated on specific domains, in a model per domain scenario. Such an +approach, however, is impractical in real large-scale applications. In this +work, we address the problem of universal image embedding, where a single +universal model is trained and used in multiple domains. First, we leverage +existing domain-specific datasets to carefully construct a new large-scale +public benchmark for the evaluation of universal image embeddings, with 241k +query images, 1.4M index images and 2.8M training images across 8 different +domains and 349k classes. We define suitable metrics, training and evaluation +protocols to foster future research in this area. Second, we provide a +comprehensive experimental evaluation on the new dataset, demonstrating that +existing approaches and simplistic extensions lead to worse performance than an +assembly of models trained for each domain separately. Finally, we conducted a +public research competition on this topic, leveraging industrial datasets, +which attracted the participation of more than 1k teams worldwide. This +exercise generated many interesting research ideas and findings which we +present in detail. Project webpage: https://cmp.felk.cvut.cz/univ_emb/",cs.CV,['cs.CV'] +Permutation Equivariance of Transformers and Its Applications,Hengyuan Xu · Liyao Xiang · Hangyu Ye · Dixi Yao · Pengzhi Chu · Baochun Li,https://github.com/Doby-Xu/ST,https://arxiv.org/abs/2403.05842,,2403.05842.pdf,Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance,"With the blossom of deep learning models and services, it has become an +imperative concern to safeguard the valuable model parameters from being +stolen. Watermarking is considered an important tool for ownership +verification. However, current watermarking schemes are customized for +different models and tasks, hard to be integrated as an integrated intellectual +protection service. We propose Hufu, a modality-agnostic watermarking system +for pre-trained Transformer-based models, relying on the permutation +equivariance property of Transformers. Hufu embeds watermark by fine-tuning the +pre-trained model on a set of data samples specifically permuted, and the +embedded model essentially contains two sets of weights -- one for normal use +and the other for watermark extraction which is triggered on permuted inputs. +The permutation equivariance ensures minimal interference between these two +sets of model weights and thus high fidelity on downstream tasks. Since our +method only depends on the model itself, it is naturally modality-agnostic, +task-independent, and trigger-sample-free. Extensive experiments on the +state-of-the-art vision Transformers, BERT, and GPT2 have demonstrated Hufu's +superiority in meeting watermarking requirements including effectiveness, +efficiency, fidelity, and robustness, showing its great potential to be +deployed as a uniform ownership verification service for various Transformers.",cs.CR,"['cs.CR', 'cs.AI']" +CLOAF: CoLlisiOn-Aware Human Flow,Andrey Davydov · Martin Engilberge · Mathieu Salzmann · Pascal Fua,https://arxiv.org/abs/2403.09050,https://arxiv.org/abs/2403.09050,,2403.09050.pdf,CLOAF: CoLlisiOn-Aware Human Flow,"Even the best current algorithms for estimating body 3D shape and pose yield +results that include body self-intersections. In this paper, we present CLOAF, +which exploits the diffeomorphic nature of Ordinary Differential Equations to +eliminate such self-intersections while still imposing body shape constraints. +We show that, unlike earlier approaches to addressing this issue, ours +completely eliminates the self-intersections without compromising the accuracy +of the reconstructions. Being differentiable, CLOAF can be used to fine-tune +pose and shape estimation baselines to improve their overall performance and +eliminate self-intersections in their predictions. Furthermore, we demonstrate +how our CLOAF strategy can be applied to practically any motion field induced +by the user. CLOAF also makes it possible to edit motion to interact with the +environment without worrying about potential collision or loss of body-shape +prior.",cs.CV,['cs.CV'] +A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction,Jin Gong · Runzhao Yang · Weihang Zhang · Jinli Suo · Qionghai Dai, ,https://arxiv.org/abs/2310.09528,,2310.09528.pdf,Hypernetwork-based Meta-Learning for Low-Rank Physics-Informed Neural Networks,"In various engineering and applied science applications, repetitive numerical +simulations of partial differential equations (PDEs) for varying input +parameters are often required (e.g., aircraft shape optimization over many +design parameters) and solvers are required to perform rapid execution. In this +study, we suggest a path that potentially opens up a possibility for +physics-informed neural networks (PINNs), emerging deep-learning-based solvers, +to be considered as one such solver. Although PINNs have pioneered a proper +integration of deep-learning and scientific computing, they require repetitive +time-consuming training of neural networks, which is not suitable for +many-query scenarios. To address this issue, we propose a lightweight low-rank +PINNs containing only hundreds of model parameters and an associated +hypernetwork-based meta-learning algorithm, which allows efficient +approximation of solutions of PDEs for varying ranges of PDE input parameters. +Moreover, we show that the proposed method is effective in overcoming a +challenging issue, known as ""failure modes"" of PINNs.",cs.LG,"['cs.LG', 'cs.NA', 'math.NA', 'physics.comp-ph']" +Pre-training Vision Models with Mandelbulb Variations,Benjamin N. Chiche · Yuto Horikawa · Ryo Fujita, ,https://arxiv.org/abs/2403.03346,,2403.03346.pdf,Enhancing Vision-Language Pre-training with Rich Supervisions,"We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel +pre-training paradigm for Vision-Language Models using data from large-scale +web screenshot rendering. Using web screenshots unlocks a treasure trove of +visual and textual cues that are not present in using image-text pairs. In S4, +we leverage the inherent tree-structured hierarchy of HTML elements and the +spatial localization to carefully design 10 pre-training tasks with large scale +annotated data. These tasks resemble downstream tasks across different domains +and the annotations are cheap to obtain. We demonstrate that, compared to +current screenshot pre-training objectives, our innovative pre-training method +significantly enhances performance of image-to-text model in nine varied and +popular downstream tasks - up to 76.1% improvements on Table Detection, and at +least 1% on Widget Captioning.",cs.CV,['cs.CV'] +Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following,Yutong Feng · Biao Gong · Di Chen · Yujun Shen · Yu Liu · Jingren Zhou, ,https://arxiv.org/abs/2311.17002,,2311.17002.pdf,Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following,"Existing text-to-image (T2I) diffusion models usually struggle in +interpreting complex prompts, especially those with quantity, object-attribute +binding, and multi-subject descriptions. In this work, we introduce a semantic +panel as the middleware in decoding texts to images, supporting the generator +to better follow instructions. The panel is obtained through arranging the +visual concepts parsed from the input text by the aid of large language models, +and then injected into the denoising network as a detailed control signal to +complement the text condition. To facilitate text-to-panel learning, we come up +with a carefully designed semantic formatting protocol, accompanied by a +fully-automatic data preparation pipeline. Thanks to such a design, our +approach, which we call Ranni, manages to enhance a pre-trained T2I generator +regarding its textual controllability. More importantly, the introduction of +the generative middleware brings a more convenient form of interaction (i.e., +directly adjusting the elements in the panel or using language instructions) +and further allows users to finely customize their generation, based on which +we develop a practical system and showcase its potential in continuous +generation and chatting-based editing. Our project page is at +https://ranni-t2i.github.io/Ranni.",cs.CV,['cs.CV'] +Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay,Yuhang Zhou · Zhongyun Hua, ,https://arxiv.org/abs/2404.01828,,2404.01828.pdf,Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay,"Deep neural networks have demonstrated susceptibility to adversarial attacks. +Adversarial defense techniques often focus on one-shot setting to maintain +robustness against attack. However, new attacks can emerge in sequences in +real-world deployment scenarios. As a result, it is crucial for a defense model +to constantly adapt to new attacks, but the adaptation process can lead to +catastrophic forgetting of previously defended against attacks. In this paper, +we discuss for the first time the concept of continual adversarial defense +under a sequence of attacks, and propose a lifelong defense baseline called +Anisotropic \& Isotropic Replay (AIR), which offers three advantages: (1) +Isotropic replay ensures model consistency in the neighborhood distribution of +new data, indirectly aligning the output preference between old and new tasks. +(2) Anisotropic replay enables the model to learn a compromise data manifold +with fresh mixed semantics for further replay constraints and potential future +attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability' +trade-off by aligning model output between new and old tasks. Experiment +results demonstrate that AIR can approximate or even exceed the empirical +performance upper bounds achieved by Joint Training.",cs.LG,"['cs.LG', 'cs.AI']" +CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution,Qingguo Liu · Chenyi Zhuang · Pan Gao · Jie Qin, ,https://arxiv.org/abs/2405.07648,,2405.07648.pdf,CDFormer:When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution,"Existing Blind image Super-Resolution (BSR) methods focus on estimating +either kernel or degradation information, but have long overlooked the +essential content details. In this paper, we propose a novel BSR approach, +Content-aware Degradation-driven Transformer (CDFormer), to capture both +degradation and content representations. However, low-resolution images cannot +provide enough content details, and thus we introduce a diffusion-based module +$CDFormer_{diff}$ to first learn Content Degradation Prior (CDP) in both low- +and high-resolution images, and then approximate the real distribution given +only low-resolution information. Moreover, we apply an adaptive SR network +$CDFormer_{SR}$ that effectively utilizes CDP to refine features. Compared to +previous diffusion-based SR methods, we treat the diffusion model as an +estimator that can overcome the limitations of expensive sampling time and +excessive diversity. Experiments show that CDFormer can outperform existing +methods, establishing a new state-of-the-art performance on various benchmarks +under blind settings. Codes and models will be available at +\href{https://github.com/I2-Multimedia-Lab/CDFormer}{https://github.com/I2-Multimedia-Lab/CDFormer}.",cs.CV,"['cs.CV', 'eess.IV']" +Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration,Tony C. W. MOK · Zi Li · Yunhao Bai · Jianpeng Zhang · Wei Liu · Yan-Jie Zhou · Ke Yan · Dakai Jin · Yu Shi · Xiaoli Yin · Le Lu · Ling Zhang, ,https://arxiv.org/abs/2402.18933,,2402.18933.pdf,Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration,"Establishing dense anatomical correspondence across distinct imaging +modalities is a foundational yet challenging procedure for numerous medical +image analysis studies and image-guided radiotherapy. Existing multi-modality +image registration algorithms rely on statistical-based similarity measures or +local structural image representations. However, the former is sensitive to +locally varying noise, while the latter is not discriminative enough to cope +with complex anatomical structures in multimodal scans, causing ambiguity in +determining the anatomical correspondence across scans with different +modalities. In this paper, we propose a modality-agnostic structural +representation learning method, which leverages Deep Neighbourhood +Self-similarity (DNS) and anatomy-aware contrastive learning to learn +discriminative and contrast-invariance deep structural image representations +(DSIR) without the need for anatomical delineations or pre-aligned training +images. We evaluate our method on multiphase CT, abdomen MR-CT, and brain MR +T1w-T2w registration. Comprehensive results demonstrate that our method is +superior to the conventional local structural representation and +statistical-based similarity measures in terms of discriminability and +accuracy.",cs.CV,['cs.CV'] +Towards Accurate and Robust Architectures via Neural Architecture Search,Yuwei Ou · Yuqi Feng · Yanan Sun, ,https://arxiv.org/abs/2405.05502,,2405.05502.pdf,Towards Accurate and Robust Architectures via Neural Architecture Search,"To defend deep neural networks from adversarial attacks, adversarial training +has been drawing increasing attention for its effectiveness. However, the +accuracy and robustness resulting from the adversarial training are limited by +the architecture, because adversarial training improves accuracy and robustness +by adjusting the weight connection affiliated to the architecture. In this +work, we propose ARNAS to search for accurate and robust architectures for +adversarial training. First we design an accurate and robust search space, in +which the placement of the cells and the proportional relationship of the +filter numbers are carefully determined. With the design, the architectures can +obtain both accuracy and robustness by deploying accurate and robust structures +to their sensitive positions, respectively. Then we propose a differentiable +multi-objective search strategy, performing gradient descent towards directions +that are beneficial for both natural loss and adversarial loss, thus the +accuracy and robustness can be guaranteed at the same time. We conduct +comprehensive experiments in terms of white-box attacks, black-box attacks, and +transferability. Experimental results show that the searched architecture has +the strongest robustness with the competitive accuracy, and breaks the +traditional idea that NAS-based architectures cannot transfer well to complex +tasks in robustness scenarios. By analyzing outstanding architectures searched, +we also conclude that accurate and robust neural architectures tend to deploy +different structures near the input and output, which has great practical +significance on both hand-crafting and automatically designing of accurate and +robust architectures.",cs.CV,"['cs.CV', 'cs.CR', 'cs.LG']" +Fast Adaptation for Human Pose Estimation via Meta-Optimization,Shengxiang Hu · Huaijiang Sun · Bin Li · Dong Wei · Weiqing Li · Jianfeng Lu, ,https://arxiv.org/abs/2405.05216,,2405.05216.pdf,FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,"The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to +predict human joint coordinates in 3D space. Despite recent advancements in +deep learning-based methods, they mostly ignore the capability of coupling +accessible texts and naturally feasible knowledge of humans, missing out on +valuable implicit supervision to guide the 3D HPE task. Moreover, previous +efforts often study this task from the perspective of the whole human body, +neglecting fine-grained guidance hidden in different body parts. To this end, +we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model +for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing +the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt +learning (FPP) block constructs fine-grained part-aware prompts via coupling +accessible texts and naturally feasible knowledge of body parts with learnable +prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication +(FPC) block establishes fine-grained communications between learned part-aware +prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp +Stylization (PTS) block integrates learned prompt embedding and temporal +information related to the noise level to enable adaptive adjustment at each +denoising step. Extensive experiments on public single-human pose estimation +datasets show that FinePOSE outperforms state-of-the-art methods. We further +extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE +on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with +complex multi-human scenarios. Code is available at +https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.",cs.CV,['cs.CV'] +PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation,Ruining Deng · Quan Liu · Can Cui · Tianyuan Yao · Jialin Yue · Juming Xiong · Lining yu · Yifei Wu · Mengmeng Yin · Yu Wang · Shilin Zhao · Yucheng Tang · Haichun Yang · Yuankai Huo, ,https://arxiv.org/abs/2402.19286,,2402.19286.pdf,PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation,"Understanding the anatomy of renal pathology is crucial for advancing disease +diagnostics, treatment evaluation, and clinical research. The complex kidney +system comprises various components across multiple levels, including regions +(cortex, medulla), functional units (glomeruli, tubules), and cells (podocytes, +mesangial cells in glomerulus). Prior studies have predominantly overlooked the +intricate spatial interrelations among objects from clinical knowledge. In this +research, we introduce a novel universal proposition learning approach, called +panoramic renal pathology segmentation (PrPSeg), designed to segment +comprehensively panoramic structures within kidney by integrating extensive +knowledge of kidney anatomy. + In this paper, we propose (1) the design of a comprehensive universal +proposition matrix for renal pathology, facilitating the incorporation of +classification and spatial relationships into the segmentation process; (2) a +token-based dynamic head single network architecture, with the improvement of +the partial label image segmentation and capability for future data +enlargement; and (3) an anatomy loss function, quantifying the inter-object +relationships across the kidney.",eess.IV,"['eess.IV', 'cs.CV']" +Analyzing and Improving the Training Dynamics of Diffusion Models,Tero Karras · Miika Aittala · Jaakko Lehtinen · Janne Hellsten · Timo Aila · Samuli Laine, ,https://arxiv.org/abs/2312.02696,,2312.02696.pdf,Analyzing and Improving the Training Dynamics of Diffusion Models,"Diffusion models currently dominate the field of data-driven image synthesis +with their unparalleled scaling to large datasets. In this paper, we identify +and rectify several causes for uneven and ineffective training in the popular +ADM diffusion model architecture, without altering its high-level structure. +Observing uncontrolled magnitude changes and imbalances in both the network +activations and weights over the course of training, we redesign the network +layers to preserve activation, weight, and update magnitudes on expectation. We +find that systematic application of this philosophy eliminates the observed +drifts and imbalances, resulting in considerably better networks at equal +computational complexity. Our modifications improve the previous record FID of +2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic +sampling. + As an independent contribution, we present a method for setting the +exponential moving average (EMA) parameters post-hoc, i.e., after completing +the training run. This allows precise tuning of EMA length without the cost of +performing several training runs, and reveals its surprising interactions with +network architecture, training time, and guidance.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML']" +POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning,Jiayi Guan · Li Shen · Ao Zhou · Lusong Li · Han Hu · Xiaodong He · Guang Chen · Changjun Jiang, ,https://arxiv.org/abs/2401.14758,,2401.14758.pdf,Off-Policy Primal-Dual Safe Reinforcement Learning,"Primal-dual safe RL methods commonly perform iterations between the primal +update of the policy and the dual update of the Lagrange Multiplier. Such a +training paradigm is highly susceptible to the error in cumulative cost +estimation since this estimation serves as the key bond connecting the primal +and dual update processes. We show that this problem causes significant +underestimation of cost when using off-policy methods, leading to the failure +to satisfy the safety constraint. To address this issue, we propose +conservative policy optimization, which learns a policy in a +constraint-satisfying area by considering the uncertainty in cost estimation. +This improves constraint satisfaction but also potentially hinders reward +maximization. We then introduce local policy convexification to help eliminate +such suboptimality by gradually reducing the estimation uncertainty. We provide +theoretical interpretations of the joint coupling effect of these two +ingredients and further verify them by extensive experiments. Results on +benchmark tasks show that our method not only achieves an asymptotic +performance comparable to state-of-the-art on-policy methods while using much +fewer samples, but also significantly reduces constraint violation during +training. Our code is available at https://github.com/ZifanWu/CAL.",cs.LG,['cs.LG'] +CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing,Guiwei Zhang · Tianyu Zhang · Guanglin Niu · Zichang Tan · Zichang Tan · Yalong Bai · Qing Yang, ,,https://openreview.net/forum?id=5a79AqFr0c,,,,,nan +VicTR: Video-conditioned Text Representations for Activity Recognition,Kumara Kahatapitiya · Anurag Arnab · Arsha Nagrani · Michael Ryoo, ,https://ar5iv.labs.arxiv.org/html/2309.00696,,2309.00696.pdf,AAN: Attributes-Aware Network for Temporal Action Detection,"The challenge of long-term video understanding remains constrained by the +efficient extraction of object semantics and the modelling of their +relationships for downstream tasks. Although the CLIP visual features exhibit +discriminative properties for various vision tasks, particularly in object +encoding, they are suboptimal for long-term video understanding. To address +this issue, we present the Attributes-Aware Network (AAN), which consists of +two key components: the Attributes Extractor and a Graph Reasoning block. These +components facilitate the extraction of object-centric attributes and the +modelling of their relationships within the video. By leveraging CLIP features, +AAN outperforms state-of-the-art approaches on two popular action detection +datasets: Charades and Toyota Smarthome Untrimmed datasets.",cs.CV,['cs.CV'] +Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain,Qunliang Xing · Mai Xu · Shengxi Li · Xin Deng · Meisong Zheng · huaida liu · Ying Chen, ,https://arxiv.org/abs/2402.17200,,2402.17200.pdf,Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain,"Existing quality enhancement methods for compressed images focus on aligning +the enhancement domain with the raw domain to yield realistic images. However, +these methods exhibit a pervasive enhancement bias towards the compression +domain, inadvertently regarding it as more realistic than the raw domain. This +bias makes enhanced images closely resemble their compressed counterparts, thus +degrading their perceptual quality. In this paper, we propose a simple yet +effective method to mitigate this bias and enhance the quality of compressed +images. Our method employs a conditional discriminator with the compressed +image as a key condition, and then incorporates a domain-divergence +regularization to actively distance the enhancement domain from the compression +domain. Through this dual strategy, our method enables the discrimination +against the compression domain, and brings the enhancement domain closer to the +raw domain. Comprehensive quality evaluations confirm the superiority of our +method over other state-of-the-art methods without incurring inference +overheads.",cs.CV,"['cs.CV', 'eess.IV']" +SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction,Conghao Wong · Beihao Xia · Ziqian Zou · Yulong Wang · Xinge You,https://cocoon2wong.github.io/SocialCircle,https://arxiv.org/abs/2310.05370,,2310.05370.pdf,SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction,"Analyzing and forecasting trajectories of agents like pedestrians and cars in +complex scenes has become more and more significant in many intelligent systems +and applications. The diversity and uncertainty in socially interactive +behaviors among a rich variety of agents make this task more challenging than +other deterministic computer vision tasks. Researchers have made a lot of +efforts to quantify the effects of these interactions on future trajectories +through different mathematical models and network structures, but this problem +has not been well solved. Inspired by marine animals that localize the +positions of their companions underwater through echoes, we build a new +anglebased trainable social interaction representation, named SocialCircle, for +continuously reflecting the context of social interactions at different angular +orientations relative to the target agent. We validate the effect of the +proposed SocialCircle by training it along with several newly released +trajectory prediction models, and experiments show that the SocialCircle not +only quantitatively improves the prediction performance, but also qualitatively +helps better simulate social interactions when forecasting pedestrian +trajectories in a way that is consistent with human intuitions.",cs.CV,['cs.CV'] +Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-training via Differentiable Rendering of Line Segments,Yusuke Takimoto · Hikari Takehara · Hiroyuki Sato · Zihao Zhu · Bo Zheng,https://dr-hair.github.io/Dr-Hair/,https://arxiv.org/abs/2403.17496,,2403.17496.pdf,Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-training via Differentiable Rendering of Line Segments,"In the film and gaming industries, achieving a realistic hair appearance +typically involves the use of strands originating from the scalp. However, +reconstructing these strands from observed surface images of hair presents +significant challenges. The difficulty in acquiring Ground Truth (GT) data has +led state-of-the-art learning-based methods to rely on pre-training with +manually prepared synthetic CG data. This process is not only labor-intensive +and costly but also introduces complications due to the domain gap when +compared to real-world data. In this study, we propose an optimization-based +approach that eliminates the need for pre-training. Our method represents hair +strands as line segments growing from the scalp and optimizes them using a +novel differentiable rendering algorithm. To robustly optimize a substantial +number of slender explicit geometries, we introduce 3D orientation estimation +utilizing global optimization, strand initialization based on Laplace's +equation, and reparameterization that leverages geometric connectivity and +spatial proximity. Unlike existing optimization-based methods, our method is +capable of reconstructing internal hair flow in an absolute direction. Our +method exhibits robust and accurate inverse rendering, surpassing the quality +of existing methods and significantly improving processing speed.",cs.CV,"['cs.CV', 'cs.GR']" +SnAG: Scalable and Accurate Video Grounding,Fangzhou Mu · Sicheng Mo · Yin Li, ,https://arxiv.org/abs/2404.02257,,2404.02257.pdf,SnAG: Scalable and Accurate Video Grounding,"Temporal grounding of text descriptions in videos is a central problem in +vision-language learning and video understanding. Existing methods often +prioritize accuracy over scalability -- they have been optimized for grounding +only a few text queries within short videos, and fail to scale up to long +videos with hundreds of queries. In this paper, we study the effect of +cross-modal fusion on the scalability of video grounding models. Our analysis +establishes late fusion as a more cost-effective fusion scheme for long-form +videos with many text queries. Moreover, it leads us to a novel, video-centric +sampling scheme for efficient training. Based on these findings, we present +SnAG, a simple baseline for scalable and accurate video grounding. Without +bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a +state of the art for long-form video grounding on the challenging MAD dataset, +while achieving highly competitive results on short videos.",cs.CV,['cs.CV'] +Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds,Tianrui Lou · Xiaojun Jia · Jindong Gu · Li Liu · Siyuan Liang · Bangyan He · Xiaochun Cao, ,https://arxiv.org/abs/2403.05247,,2403.05247.pdf,Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds,"Adversarial attack methods based on point manipulation for 3D point cloud +classification have revealed the fragility of 3D models, yet the adversarial +examples they produce are easily perceived or defended against. The trade-off +between the imperceptibility and adversarial strength leads most point attack +methods to inevitably introduce easily detectable outlier points upon a +successful attack. Another promising strategy, shape-based attack, can +effectively eliminate outliers, but existing methods often suffer significant +reductions in imperceptibility due to irrational deformations. We find that +concealing deformation perturbations in areas insensitive to human eyes can +achieve a better trade-off between imperceptibility and adversarial strength, +specifically in parts of the object surface that are complex and exhibit +drastic curvature changes. Therefore, we propose a novel shape-based +adversarial attack method, HiT-ADV, which initially conducts a two-stage search +for attack regions based on saliency and imperceptibility scores, and then adds +deformation perturbations in each attack region using Gaussian kernel +functions. Additionally, HiT-ADV is extendable to physical attack. We propose +that by employing benign resampling and benign rigid transformations, we can +further enhance physical adversarial strength with little sacrifice to +imperceptibility. Extensive experiments have validated the superiority of our +method in terms of adversarial and imperceptible properties in both digital and +physical spaces. Our code is avaliable at: https://github.com/TRLou/HiT-ADV.",cs.CV,"['cs.CV', 'eess.IV']" +PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization,Zining Chen · Weiqiu Wang · Zhicheng Zhao · Fei Su · Aidong Men · Hongying Meng, ,https://arxiv.org/abs/2404.09011,,2404.09011.pdf,PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization,"Domain Generalization (DG) aims to resolve distribution shifts between source +and target domains, and current DG methods are default to the setting that data +from source and target domains share identical categories. Nevertheless, there +exists unseen classes from target domains in practical scenarios. To address +this issue, Open Set Domain Generalization (OSDG) has emerged and several +methods have been exclusively proposed. However, most existing methods adopt +complex architectures with slight improvement compared with DG methods. +Recently, vision-language models (VLMs) have been introduced in DG following +the fine-tuning paradigm, but consume huge training overhead with large vision +models. Therefore, in this paper, we innovate to transfer knowledge from VLMs +to lightweight vision models and improve the robustness by introducing +Perturbation Distillation (PD) from three perspectives, including Score, Class +and Instance (SCI), named SCI-PD. Moreover, previous methods are oriented by +the benchmarks with identical and fixed splits, ignoring the divergence between +source domains. These methods are revealed to suffer from sharp performance +decay with our proposed new benchmark Hybrid Domain Generalization (HDG) and a +novel metric $H^{2}$-CV, which construct various splits to comprehensively +assess the robustness of algorithms. Extensive experiments demonstrate that our +method outperforms state-of-the-art algorithms on multiple datasets, especially +improving the robustness when confronting data scarcity.",cs.CV,"['cs.CV', 'cs.LG']" +Kernel Adaptive Convolution for Scene Text Detection via Distance Map Prediction,Jinzhi Zheng · Heng Fan · Libo Zhang, ,https://arxiv.org/html/2401.11704v1,,2401.11704v1.pdf,EK-Net:Real-time Scene Text Detection with Expand Kernel Distance,"Recently, scene text detection has received significant attention due to its +wide application. However, accurate detection in complex scenes of multiple +scales, orientations, and curvature remains a challenge. Numerous detection +methods adopt the Vatti clipping (VC) algorithm for multiple-instance training +to address the issue of arbitrary-shaped text. Yet we identify several bias +results from these approaches called the ""shrinked kernel"". Specifically, it +refers to a decrease in accuracy resulting from an output that overly favors +the text kernel. In this paper, we propose a new approach named Expand Kernel +Network (EK-Net) with expand kernel distance to compensate for the previous +deficiency, which includes three-stages regression to complete instance +detection. Moreover, EK-Net not only realize the precise positioning of +arbitrary-shaped text, but also achieve a trade-off between performance and +speed. Evaluation results demonstrate that EK-Net achieves state-of-the-art or +competitive performance compared to other advanced methods, e.g., F-measure of +85.72% at 35.42 FPS on ICDAR 2015, F-measure of 85.75% at 40.13 FPS on CTW1500.",cs.CV,['cs.CV'] +CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion,Xiaoyu Wu · Yang Hua · Chumeng Liang · Jiaru Zhang · Hao Wang · Tao Song · Haibing Guan,https://github.com/Nicholas0228/Revelio,https://arxiv.org/abs/2403.11162,,2403.11162.pdf,CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion,"Diffusion Models (DMs) have evolved into advanced image generation tools, +especially for few-shot generation where a pretrained model is fine-tuned on a +small set of images to capture a specific style or object. Despite their +success, concerns exist about potential copyright violations stemming from the +use of unauthorized data in this process. In response, we present Contrasting +Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring +vivid visual representations for digital copyright authentication. Our approach +involves removing partial information of an image and recovering missing +details by exploiting conceptual differences between the pretrained and +fine-tuned models. We formulate the differences as KL divergence between latent +variables of the two models when given the same input image, which can be +maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). +The similarity between original and recovered images serves as a strong +indicator of potential infringements. Extensive experiments on the WikiArt and +Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital +copyright authentication, surpassing alternative validation techniques. Code +implementation is available at https://github.com/Nicholas0228/Revelio.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR', 'cs.CY', 'cs.LG']" +Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration,Yuxi Wei · Zi Wang · Yifan Lu · Chenxin Xu · Changxing Liu · Hao Zhao · Siheng Chen · Yanfeng Wang,https://yifanlu0227.github.io/ChatSim/,https://arxiv.org/abs/2402.05746,,2402.05746.pdf,Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents,"Scene simulation in autonomous driving has gained significant attention +because of its huge potential for generating customized data. However, existing +editable scene simulation approaches face limitations in terms of user +interaction efficiency, multi-camera photo-realistic rendering and external +digital assets integration. To address these challenges, this paper introduces +ChatSim, the first system that enables editable photo-realistic 3D driving +scene simulations via natural language commands with external digital assets. +To enable editing with high command flexibility,~ChatSim leverages a large +language model (LLM) agent collaboration framework. To generate photo-realistic +outcomes, ChatSim employs a novel multi-camera neural radiance field method. +Furthermore, to unleash the potential of extensive high-quality digital assets, +ChatSim employs a novel multi-camera lighting estimation method to achieve +scene-consistent assets' rendering. Our experiments on Waymo Open Dataset +demonstrate that ChatSim can handle complex language commands and generate +corresponding photo-realistic scene videos.",cs.CV,['cs.CV'] +An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning,Jianqing Zhang · Yang Liu · Yang Hua · Jian Cao,https://github.com/TsingZ0/FedKTL,https://arxiv.org/abs/2403.15760,,2403.15760.pdf,An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning,"Heterogeneous Federated Learning (HtFL) enables collaborative learning on +multiple clients with different model architectures while preserving privacy. +Despite recent research progress, knowledge sharing in HtFL is still difficult +due to data and model heterogeneity. To tackle this issue, we leverage the +knowledge stored in pre-trained generators and propose a new upload-efficient +knowledge transfer scheme called Federated Knowledge-Transfer Loop (FedKTL). +Our FedKTL can produce client-task-related prototypical image-vector pairs via +the generator's inference on the server. With these pairs, each client can +transfer pre-existing knowledge from the generator to its local model through +an additional supervised local task. We conduct extensive experiments on four +datasets under two types of data heterogeneity with 14 kinds of models +including CNNs and ViTs. Results show that our upload-efficient FedKTL +surpasses seven state-of-the-art methods by up to 7.31% in accuracy. Moreover, +our knowledge transfer scheme is applicable in scenarios with only one edge +client. Code: https://github.com/TsingZ0/FedKTL",cs.AI,"['cs.AI', 'cs.DC']" +Language-conditioned Detection Transformer,Jang Hyun Cho · Philipp Krähenbühl,https://janghyuncho.github.io/DECOLA/,,https://www.semanticscholar.org/paper/Language-conditioned-Detection-Transformer-Cho-Krähenbühl/d590b8cabee3630327fa72149a2b137b2c0892f9/figure/0,,,,,nan +Audio-Visual Segmentation via Unlabeled Frame Exploitation,Jinxiang Liu · Yikun Liu · Ferenas · Chen Ju · Ya Zhang · Yanfeng Wang, ,https://arxiv.org/abs/2403.11074,,2403.11074.pdf,Audio-Visual Segmentation via Unlabeled Frame Exploitation,"Audio-visual segmentation (AVS) aims to segment the sounding objects in video +frames. Although great progress has been witnessed, we experimentally reveal +that current methods reach marginal performance gain within the use of the +unlabeled frames, leading to the underutilization issue. To fully explore the +potential of the unlabeled frames for AVS, we explicitly divide them into two +categories based on their temporal characteristics, i.e., neighboring frame +(NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, +often contain rich motion information that assists in the accurate localization +of sounding objects. Contrary to NFs, DFs have long temporal distances from the +labeled frame, which share semantic-similar objects with appearance variations. +Considering their unique characteristics, we propose a versatile framework that +effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the +motion cues as the dynamic guidance to improve the objectness localization. +Besides, we exploit the semantic cues in DFs by treating them as valid +augmentations to the labeled frames, which are then used to enrich data +diversity in a self-training manner. Extensive experimental results demonstrate +the versatility and superiority of our method, unleashing the power of the +abundant unlabeled frames.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM', 'cs.SD', 'eess.AS']" +Distilling ODE Solvers of Diffusion Models into Smaller Steps,Sanghwan Kim · Hao Tang · Fisher Yu, ,https://arxiv.org/abs/2309.16421,,2309.16421.pdf,Distilling ODE Solvers of Diffusion Models into Smaller Steps,"Abstract Diffusion models have recently gained prominence as a novel category +of generative models. Despite their success, these models face a notable +drawback in terms of slow sampling speeds, requiring a high number of function +evaluations (NFE) in the order of hundreds or thousands. In response, both +learning-free and learning-based sampling strategies have been explored to +expedite the sampling process. Learning-free sampling employs various ordinary +differential equation (ODE) solvers based on the formulation of diffusion ODEs. +However, it encounters challenges in faithfully tracking the true sampling +trajectory, particularly for small NFE. Conversely, learning-based sampling +methods, such as knowledge distillation, demand extensive additional training, +limiting their practical applicability. To overcome these limitations, we +introduce Distilled-ODE solvers (D-ODE solvers), a straightforward distillation +approach grounded in ODE solver formulations. Our method seamlessly integrates +the strengths of both learning-free and learning-based sampling. D-ODE solvers +are constructed by introducing a single parameter adjustment to existing ODE +solvers. Furthermore, we optimize D-ODE solvers with smaller steps using +knowledge distillation from ODE solvers with larger steps across a batch of +samples. Comprehensive experiments demonstrate the superior performance of +D-ODE solvers compared to existing ODE solvers, including DDIM, PNDM, +DPM-Solver, DEIS, and EDM, particularly in scenarios with fewer NFE. Notably, +our method incurs negligible computational overhead compared to previous +distillation techniques, facilitating straightforward and rapid integration +with existing samplers. Qualitative analysis reveals that D-ODE solvers not +only enhance image quality but also faithfully follow the target ODE +trajectory.",cs.CV,['cs.CV'] +Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation,Keonhee Han · Dominik Muhle · Felix Wimbauer · Daniel Cremers,https://keonhee-han.github.io/publications/kdbts/,https://arxiv.org/abs/2404.07933,,2404.07933.pdf,Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation,"Inferring scene geometry from images via Structure from Motion is a +long-standing and fundamental problem in computer vision. While classical +approaches and, more recently, depth map predictions only focus on the visible +parts of a scene, the task of scene completion aims to reason about geometry +even in occluded regions. With the popularity of neural radiance fields +(NeRFs), implicit representations also became popular for scene completion by +predicting so-called density fields. Unlike explicit approaches. e.g. +voxel-based methods, density fields also allow for accurate depth prediction +and novel-view synthesis via image-based rendering. In this work, we propose to +fuse the scene reconstruction from multiple images and distill this knowledge +into a more accurate single-view scene reconstruction. To this end, we propose +Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed +images, trained fully self-supervised only from image data. Using knowledge +distillation, we use MVBTS to train a single-view scene completion network via +direct supervision called KDBTS. It achieves state-of-the-art performance on +occupancy prediction, especially in occluded regions.",cs.CV,['cs.CV'] +Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion,Sofia Casarin · Cynthia Ugwu · Sergio Escalera · Oswald Lanz, ,https://arxiv.org/abs/2403.15194,,2403.15194.pdf,Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion,"The landscape of deep learning research is moving towards innovative +strategies to harness the true potential of data. Traditionally, emphasis has +been on scaling model architectures, resulting in large and complex neural +networks, which can be difficult to train with limited computational resources. +However, independently of the model size, data quality (i.e. amount and +variability) is still a major factor that affects model generalization. In this +work, we propose a novel technique to exploit available data through the use of +automatic data augmentation for the tasks of image classification and semantic +segmentation. We introduce the first Differentiable Augmentation Search method +(DAS) to generate variations of images that can be processed as videos. +Compared to previous approaches, DAS is extremely fast and flexible, allowing +the search on very large search spaces in less than a GPU day. Our intuition is +that the increased receptive field in the temporal dimension provided by DAS +could lead to benefits also to the spatial receptive field. More specifically, +we leverage DAS to guide the reshaping of the spatial receptive field by +selecting task-dependant transformations. As a result, compared to standard +augmentation alternatives, we improve in terms of accuracy on ImageNet, +Cifar10, Cifar100, Tiny-ImageNet, Pascal-VOC-2012 and CityScapes datasets when +plugging-in our DAS over different light-weight video backbones.",cs.CV,"['cs.CV', 'cs.LG']" +A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution,Zhixiong Yang · Jingyuan Xia · Shengxi Li · Xinghua Huang · Shuanghui Zhang · Zhen Liu · Yaowen Fu · Yongxiang Liu, ,https://arxiv.org/abs/2404.15620,,2404.15620.pdf,A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution,"Deep learning-based methods have achieved significant successes on solving +the blind super-resolution (BSR) problem. However, most of them request +supervised pre-training on labelled datasets. This paper proposes an +unsupervised kernel estimation model, named dynamic kernel prior (DKP), to +realize an unsupervised and pre-training-free learning-based algorithm for +solving the BSR problem. DKP can adaptively learn dynamic kernel priors to +realize real-time kernel estimation, and thereby enables superior HR image +restoration performances. This is achieved by a Markov chain Monte Carlo +sampling process on random kernel distributions. The learned kernel prior is +then assigned to optimize a blur kernel estimation network, which entails a +network-based Langevin dynamic optimization strategy. These two techniques +ensure the accuracy of the kernel estimation. DKP can be easily used to replace +the kernel estimation models in the existing methods, such as Double-DIP and +FKP-DIP, or be added to the off-the-shelf image restoration model, such as +diffusion model. In this paper, we incorporate our DKP model with DIP and +diffusion model, referring to DIP-DKP and Diff-DKP, for validations. Extensive +simulations on Gaussian and motion kernel scenarios demonstrate that the +proposed DKP model can significantly improve the kernel estimation with +comparable runtime and memory usage, leading to state-of-the-art BSR results. +The code is available at https://github.com/XYLGroup/DKP.",eess.IV,['eess.IV'] +DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking,Fei Xie · Zhongdao Wang · Chao Ma, ,https://arxiv.org/abs/2308.09905,,2308.09905.pdf,DiffusionTrack: Diffusion Model For Multi-Object Tracking,"Multi-object tracking (MOT) is a challenging vision task that aims to detect +individual objects within a single frame and associate them across multiple +frames. Recent MOT approaches can be categorized into two-stage +tracking-by-detection (TBD) methods and one-stage joint detection and tracking +(JDT) methods. Despite the success of these approaches, they also suffer from +common problems, such as harmful global or local inconsistency, poor trade-off +between robustness and model complexity, and lack of flexibility in different +scenes within the same video. In this paper we propose a simple but robust +framework that formulates object detection and association jointly as a +consistent denoising diffusion process from paired noise boxes to paired +ground-truth boxes. This novel progressive denoising diffusion strategy +substantially augments the tracker's effectiveness, enabling it to discriminate +between various objects. During the training stage, paired object boxes diffuse +from paired ground-truth boxes to random distribution, and the model learns +detection and tracking simultaneously by reversing this noising process. In +inference, the model refines a set of paired randomly generated boxes to the +detection and tracking results in a flexible one-step or multi-step denoising +diffusion process. Extensive experiments on three widely used MOT benchmarks, +including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves +competitive performance compared to the current state-of-the-art methods.",cs.CV,['cs.CV'] +SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field,Lizhe Liu · Bohua Wang · Hongwei Xie · Daqi Liu · Li Liu · Kuiyuan Yang · Bing Wang · Zhiqiang Tian, ,https://arxiv.org/abs/2403.14366,,2403.14366.pdf,SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field,"Vision-centric 3D environment understanding is both vital and challenging for +autonomous driving systems. Recently, object-free methods have attracted +considerable attention. Such methods perceive the world by predicting the +semantics of discrete voxel grids but fail to construct continuous and accurate +obstacle surfaces. To this end, in this paper, we propose SurroundSDF to +implicitly predict the signed distance field (SDF) and semantic field for the +continuous perception from surround images. Specifically, we introduce a +query-based approach and utilize SDF constrained by the Eikonal formulation to +accurately describe the surfaces of obstacles. Furthermore, considering the +absence of precise SDF ground truth, we propose a novel weakly supervised +paradigm for SDF, referred to as the Sandwich Eikonal formulation, which +emphasizes applying correct and dense constraints on both sides of the surface, +thereby enhancing the perceptual accuracy of the surface. Experiments suggest +that our method achieves SOTA for both occupancy prediction and 3D scene +reconstruction tasks on the nuScenes dataset.",cs.CV,['cs.CV'] +DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior,Tianyu Huang · Yihan Zeng · Zhilu Zhang · Wan Xu · Hang Xu · Songcen Xu · Rynson W.H. Lau · Wangmeng Zuo,https://github.com/tyhuang0428/DreamControl,https://arxiv.org/abs/2312.06439,,2312.06439.pdf,DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior,"3D generation has raised great attention in recent years. With the success of +text-to-image diffusion models, the 2D-lifting technique becomes a promising +route to controllable 3D generation. However, these methods tend to present +inconsistent geometry, which is also known as the Janus problem. We observe +that the problem is caused mainly by two aspects, i.e., viewpoint bias in 2D +diffusion models and overfitting of the optimization objective. To address it, +we propose a two-stage 2D-lifting framework, namely DreamControl, which +optimizes coarse NeRF scenes as 3D self-prior and then generates fine-grained +objects with control-based score distillation. Specifically, adaptive viewpoint +sampling and boundary integrity metric are proposed to ensure the consistency +of generated priors. The priors are then regarded as input conditions to +maintain reasonable geometries, in which conditional LoRA and weighted score +are further proposed to optimize detailed textures. DreamControl can generate +high-quality 3D content in terms of both geometry consistency and texture +fidelity. Moreover, our control-based optimization guidance is applicable to +more downstream tasks, including user-guided generation and 3D animation. The +project page is available at https://github.com/tyhuang0428/DreamControl.",cs.CV,['cs.CV'] +Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID,Wentao Tan · Changxing Ding · Jiayu Jiang · Fei Wang · Yibing Zhan · Dapeng Tao, ,https://arxiv.org/abs/2405.04940,,,Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID,"Text-to-image person re-identification (ReID) retrieves pedestrian images +according to textual descriptions. Manually annotating textual descriptions is +time-consuming, restricting the scale of existing datasets and therefore the +generalization ability of ReID models. As a result, we study the transferable +text-to-image ReID problem, where we train a model on our proposed large-scale +database and directly deploy it to various datasets for evaluation. We obtain +substantial training data via Multi-modal Large Language Models (MLLMs). +Moreover, we identify and address two key challenges in utilizing the obtained +textual descriptions. First, an MLLM tends to generate descriptions with +similar structures, causing the model to overfit specific sentence patterns. +Thus, we propose a novel method that uses MLLMs to caption images according to +various templates. These templates are obtained using a multi-turn dialogue +with a Large Language Model (LLM). Therefore, we can build a large-scale +dataset with diverse textual descriptions. Second, an MLLM may produce +incorrect descriptions. Hence, we introduce a novel method that automatically +identifies words in a description that do not correspond with the image. This +method is based on the similarity between one text and all patch token +embeddings in the image. Then, we mask these words with a larger probability in +the subsequent training epoch, alleviating the impact of noisy textual +descriptions. The experimental results demonstrate that our methods +significantly boost the direct transfer text-to-image ReID performance. +Benefiting from the pre-trained model weights, we also achieve state-of-the-art +performance in the traditional evaluation settings.",cs.CV,['cs.CV'] +Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields,Leili Goli · Cody Reading · Silvia Sellán · Alec Jacobson · Andrea Tagliasacchi, ,https://arxiv.org/abs/2309.03185,,2309.03185.pdf,Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields,"Neural Radiance Fields (NeRFs) have shown promise in applications like view +synthesis and depth estimation, but learning from multiview images faces +inherent uncertainties. Current methods to quantify them are either heuristic +or computationally demanding. We introduce BayesRays, a post-hoc framework to +evaluate uncertainty in any pre-trained NeRF without modifying the training +process. Our method establishes a volumetric uncertainty field using spatial +perturbations and a Bayesian Laplace approximation. We derive our algorithm +statistically and show its superior performance in key metrics and +applications. Additional results available at: https://bayesrays.github.io.",cs.CV,['cs.CV'] +CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment,Sajid Javed · Arif Mahmood · IYYAKUTTI IYAPPAN GANAPATHI · Fayaz Ali · Naoufel Werghi · Mohammed Bennamoun, ,https://arxiv.org/abs/2306.07831,,2306.07831.pdf,Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images,"Contrastive visual language pretraining has emerged as a powerful method for +either training new language-aware image encoders or augmenting existing +pretrained models with zero-shot visual recognition capabilities. However, +existing works typically train on large datasets of image-text pairs and have +been designed to perform downstream tasks involving only small to medium +sized-images, neither of which are applicable to the emerging field of +computational pathology where there are limited publicly available paired +image-text datasets and each image can span up to 100,000 x 100,000 pixels. In +this paper we present MI-Zero, a simple and intuitive framework for unleashing +the zero-shot transfer capabilities of contrastively aligned image and text +models on gigapixel histopathology whole slide images, enabling multiple +downstream diagnostic tasks to be carried out by pretrained encoders without +requiring any additional labels. MI-Zero reformulates zero-shot transfer under +the framework of multiple instance learning to overcome the computational +challenge of inference on extremely large images. We used over 550k pathology +reports and other available in-domain text corpora to pre-train our text +encoder. By effectively leveraging strong pre-trained encoders, our best model +pretrained on over 33k histopathology image-caption pairs achieves an average +median zero-shot accuracy of 70.2% across three different real-world cancer +subtyping tasks. Our code is available at: +https://github.com/mahmoodlab/MI-Zero.",cs.CV,['cs.CV'] +UniMODE: Unified Monocular 3D Object Detection,Zhuoling Li · Xiaogang Xu · Ser-Nam Lim · Hengshuang Zhao, ,https://arxiv.org/abs/2402.18573,,2402.18573.pdf,UniMODE: Unified Monocular 3D Object Detection,"Realizing unified monocular 3D object detection, including both indoor and +outdoor scenes, holds great importance in applications like robot navigation. +However, involving various scenarios of data to train models poses challenges +due to their significantly different characteristics, e.g., diverse geometry +properties and heterogeneous domain distributions. To address these challenges, +we build a detector based on the bird's-eye-view (BEV) detection paradigm, +where the explicit feature projection is beneficial to addressing the geometry +learning ambiguity when employing multiple scenarios of data to train +detectors. Then, we split the classical BEV detection architecture into two +stages and propose an uneven BEV grid design to handle the convergence +instability caused by the aforementioned challenges. Moreover, we develop a +sparse BEV feature projection strategy to reduce computational cost and a +unified domain alignment method to handle heterogeneous domains. Combining +these techniques, a unified detector UniMODE is derived, which surpasses the +previous state-of-the-art on the challenging Omni3D dataset (a large-scale +dataset including both indoor and outdoor scenes) by 4.9% AP_3D, revealing the +first successful generalization of a BEV detector to unified 3D object +detection.",cs.CV,['cs.CV'] +Perceptual Assessment and Optimization of HDR Image Rendering,Peibei Cao · Rafal Mantiuk · Kede Ma, ,https://arxiv.org/abs/2310.12877v4,,2310.12877v4.pdf,Perceptual Assessment and Optimization of High Dynamic Range Image Rendering,"High dynamic range (HDR) rendering has the ability to faithfully reproduce +the wide luminance ranges in natural scenes, but how to accurately assess the +rendering quality is relatively underexplored. Existing quality models are +mostly designed for low dynamic range (LDR) images, and do not align well with +human perception of HDR image quality. To fill this gap, we propose a family of +HDR quality metrics, in which the key step is employing a simple inverse +display model to decompose an HDR image into a stack of LDR images with varying +exposures. Subsequently, these decomposed images are assessed through +well-established LDR quality metrics. Our HDR quality models present three +distinct benefits. First, they directly inherit the recent advancements of LDR +quality metrics. Second, they do not rely on human perceptual data of HDR image +quality for re-calibration. Third, they facilitate the alignment and +prioritization of specific luminance ranges for more accurate and detailed +quality assessment. Experimental results show that our HDR quality metrics +consistently outperform existing models in terms of quality assessment on four +HDR image quality datasets and perceptual optimization of HDR novel view +synthesis.",eess.IV,"['eess.IV', 'cs.CV']" +From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior,Jaeho Moon · Juan Luis Gonzalez Bello · Byeongjun Kwon · Munchurl Kim,https://kaist-viclab.github.io/From_Ground_To_Objects_site/,https://arxiv.org/abs/2312.10118,,2312.10118.pdf,From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior,"Self-supervised monocular depth estimation (DE) is an approach to learning +depth without costly depth ground truths. However, it often struggles with +moving objects that violate the static scene assumption during training. To +address this issue, we introduce a coarse-to-fine training strategy leveraging +the ground contacting prior based on the observation that most moving objects +in outdoor scenes contact the ground. In the coarse training stage, we exclude +the objects in dynamic classes from the reprojection loss calculation to avoid +inaccurate depth learning. To provide precise supervision on the depth of the +objects, we present a novel Ground-contacting-prior Disparity Smoothness Loss +(GDS-Loss) that encourages a DE network to align the depth of the objects with +their ground-contacting points. Subsequently, in the fine training stage, we +refine the DE network to learn the detailed depth of the objects from the +reprojection loss, while ensuring accurate DE on the moving object regions by +employing our regularization loss with a cost-volume-based weighting factor. +Our overall coarse-to-fine training strategy can easily be integrated with +existing DE methods without any modifications, significantly enhancing DE +performance on challenging Cityscapes and KITTI datasets, especially in the +moving object regions.",cs.CV,['cs.CV'] +DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models,Khawar Islam · Muhammad Zaigham Zaheer · Arif Mahmood · Karthik Nandakumar,https://diffusemix.github.io/,https://arxiv.org/abs/2405.14881,,2405.14881.pdf,DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models,"Recently, a number of image-mixing-based augmentation techniques have been +introduced to improve the generalization of deep neural networks. In these +techniques, two or more randomly selected natural images are mixed together to +generate an augmented image. Such methods may not only omit important portions +of the input images but also introduce label ambiguities by mixing images +across labels resulting in misleading supervisory signals. To address these +limitations, we propose DiffuseMix, a novel data augmentation technique that +leverages a diffusion model to reshape training images, supervised by our +bespoke conditional prompts. First, concatenation of a partial natural image +and its generated counterpart is obtained which helps in avoiding the +generation of unrealistic images or label ambiguities. Then, to enhance +resilience against adversarial attacks and improves safety measures, a randomly +selected structural pattern from a set of fractal images is blended into the +concatenated image to form the final augmented image for training. Our +empirical results on seven different datasets reveal that DiffuseMix achieves +superior performance compared to existing state-of the-art methods on tasks +including general classification,fine-grained classification, fine-tuning, data +scarcity, and adversarial robustness. Augmented datasets and codes are +available here: https://diffusemix.github.io/",cs.CV,['cs.CV'] +Neural Exposure Fusion for High-Dynamic Range Object Detection,Emmanuel Onzon · Maximilian Bömer · Fahim Mannan · Felix Heide, ,https://arxiv.org/abs/2405.16038,,2405.16038.pdf,Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection,"Most recent multispectral object detectors employ a two-branch structure to +extract features from RGB and thermal images. While the two-branch structure +achieves better performance than a single-branch structure, it overlooks +inference efficiency. This conflict is increasingly aggressive, as recent works +solely pursue higher performance rather than both performance and efficiency. +In this paper, we address this issue by improving the performance of efficient +single-branch structures. We revisit the reasons causing the performance gap +between these structures. For the first time, we reveal the information +interference problem in the naive early-fusion strategy adopted by previous +single-branch structures. Besides, we find that the domain gap between +multispectral images, and weak feature representation of the single-branch +structure are also key obstacles for performance. Focusing on these three +problems, we propose corresponding solutions, including a novel shape-priority +early-fusion strategy, a weakly supervised learning method, and a core +knowledge distillation technique. Experiments demonstrate that single-branch +networks equipped with these three contributions achieve significant +performance enhancements while retaining high efficiency. Our code will be +available at +\url{https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection}.",cs.CV,['cs.CV'] +Cross-view and Cross-pose Completion for 3D Human Understanding,Matthieu Armando · Salma Galaaoui · Fabien Baradel · Thomas Lucas · Vincent Leroy · Romain BRÉGIER · Philippe Weinzaepfel · Grégory Rogez, ,https://arxiv.org/abs/2311.09104,,2311.09104.pdf,Cross-view and Cross-pose Completion for 3D Human Understanding,"Human perception and understanding is a major domain of computer vision +which, like many other vision subdomains recently, stands to gain from the use +of large models pre-trained on large datasets. We hypothesize that the most +common pre-training strategy of relying on general purpose, object-centric +image datasets such as ImageNet, is limited by an important domain shift. On +the other hand, collecting domain-specific ground truth such as 2D or 3D labels +does not scale well. Therefore, we propose a pre-training approach based on +self-supervised learning that works on human-centric data using only images. +Our method uses pairs of images of humans: the first is partially masked and +the model is trained to reconstruct the masked parts given the visible ones and +a second image. It relies on both stereoscopic (cross-view) pairs, and temporal +(cross-pose) pairs taken from videos, in order to learn priors about 3D as well +as human motion. We pre-train a model for body-centric tasks and one for +hand-centric tasks. With a generic transformer architecture, these models +outperform existing self-supervised pre-training methods on a wide set of +human-centric downstream tasks, and obtain state-of-the-art performance for +instance when fine-tuning for model-based and model-free human mesh recovery.",cs.CV,['cs.CV'] +Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation,Zhiwu Qing · Shiwei Zhang · Jiayu Wang · Xiang Wang · Yujie Wei · Yingya Zhang · Changxin Gao · Nong Sang, ,https://arxiv.org/abs/2312.04483,,2312.04483.pdf,Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation,"Despite diffusion models having shown powerful abilities to generate +photorealistic images, generating videos that are realistic and diverse still +remains in its infancy. One of the key reasons is that current methods +intertwine spatial content and temporal dynamics together, leading to a notably +increased complexity of text-to-video generation (T2V). In this work, we +propose HiGen, a diffusion model-based method that improves performance by +decoupling the spatial and temporal factors of videos from two perspectives, +i.e., structure level and content level. At the structure level, we decompose +the T2V task into two steps, including spatial reasoning and temporal +reasoning, using a unified denoiser. Specifically, we generate spatially +coherent priors using text during spatial reasoning and then generate +temporally coherent motions from these priors during temporal reasoning. At the +content level, we extract two subtle cues from the content of the input video +that can express motion and appearance changes, respectively. These two cues +then guide the model's training for generating videos, enabling flexible +content variations and enhancing temporal stability. Through the decoupled +paradigm, HiGen can effectively reduce the complexity of this task and generate +realistic videos with semantics accuracy and motion stability. Extensive +experiments demonstrate the superior performance of HiGen over the +state-of-the-art T2V methods.",cs.CV,['cs.CV'] +SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks,Xinyu Shi · Zecheng Hao · Zhaofei Yu, ,https://arxiv.org/abs/2403.14302,,2403.14302.pdf,SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks,"The remarkable success of Vision Transformers in Artificial Neural Networks +(ANNs) has led to a growing interest in incorporating the self-attention +mechanism and transformer-based architecture into Spiking Neural Networks +(SNNs). While existing methods propose spiking self-attention mechanisms that +are compatible with SNNs, they lack reasonable scaling methods, and the overall +architectures proposed by these methods suffer from a bottleneck in effectively +extracting local features. To address these challenges, we propose a novel +spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a +reasonable scaling method. Based on DSSA, we propose a novel spiking Vision +Transformer architecture called SpikingResformer, which combines the +ResNet-based multi-stage architecture with our proposed DSSA to improve both +performance and energy efficiency while reducing parameters. Experimental +results show that SpikingResformer achieves higher accuracy with fewer +parameters and lower energy consumption than other spiking Vision Transformer +counterparts. Notably, our SpikingResformer-L achieves 79.40% top-1 accuracy on +ImageNet with 4 time-steps, which is the state-of-the-art result in the SNN +field.",cs.NE,"['cs.NE', 'cs.CV', 'cs.LG']" +C$^2$KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation,Fushuo Huo · Wenchao Xu · Jingcai Guo · Haozhao Wang · Song Guo, ,https://arxiv.org/abs/2312.17648,,2312.17648.pdf,Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation,"Visual grounding aims to align visual information of specific regions of +images with corresponding natural language expressions. Current visual +grounding methods leverage pre-trained visual and language backbones separately +to obtain visual features and linguistic features. Although these two types of +features are then fused via delicately designed networks, the heterogeneity of +the features makes them inapplicable for multi-modal reasoning. This problem +arises from the domain gap between the single-modal pre-training backbone used +in current visual grounding methods, which can hardly be overcome by the +traditional end-to-end training method. To alleviate this, our work proposes an +Empowering pre-trained model for Visual Grounding (EpmVG) framework, which +distills a multimodal pre-trained model to guide the visual grounding task. +EpmVG is based on a novel cross-modal distillation mechanism, which can +effectively introduce the consistency information of images and texts in the +pre-trained model, to reduce the domain gap existing in the backbone networks, +thereby improving the performance of the model in the visual grounding task. +Extensive experiments are carried out on five conventionally used datasets, and +results demonstrate that our method achieves better performance than +state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.AI']" +Map-Relative Pose Regression for Visual Re-Localization,Shuai Chen · Tommaso Cavallari · Victor Adrian Prisacariu · Eric Brachmann, ,https://arxiv.org/abs/2404.09884,,2404.09884.pdf,Map-Relative Pose Regression for Visual Re-Localization,"Pose regression networks predict the camera pose of a query image relative to +a known environment. Within this family of methods, absolute pose regression +(APR) has recently shown promising accuracy in the range of a few centimeters +in position error. APR networks encode the scene geometry implicitly in their +weights. To achieve high accuracy, they require vast amounts of training data +that, realistically, can only be created using novel view synthesis in a +days-long process. This process has to be repeated for each new scene again and +again. We present a new approach to pose regression, map-relative pose +regression (marepo), that satisfies the data hunger of the pose regression +network in a scene-agnostic fashion. We condition the pose regressor on a +scene-specific map representation such that its pose predictions are relative +to the scene map. This allows us to train the pose regressor across hundreds of +scenes to learn the generic relation between a scene-specific map +representation and the camera pose. Our map-relative pose regressor can be +applied to new map representations immediately or after mere minutes of +fine-tuning for the highest accuracy. Our approach outperforms previous pose +regression methods by far on two public datasets, indoor and outdoor. Code is +available: https://nianticlabs.github.io/marepo",cs.CV,"['cs.CV', 'cs.LG']" +Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation,Zihan Wang · Xiangyang Li · Jiahao Yang · Yeqi Liu · Junjie Hu · Ming Jiang · Shuqiang Jiang,https://github.com/MrZihan/HNR-VLN,https://arxiv.org/abs/2404.01943,,2404.01943.pdf,Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation,"Vision-and-language navigation (VLN) enables the agent to navigate to a +remote location following the natural language instruction in 3D environments. +At each navigation step, the agent selects from possible candidate locations +and then makes the move. For better navigation planning, the lookahead +exploration strategy aims to effectively evaluate the agent's next action by +accurately anticipating the future environment of candidate locations. To this +end, some existing works predict RGB images for future environments, while this +strategy suffers from image distortion and high computational cost. To address +these issues, we propose the pre-trained hierarchical neural radiance +representation model (HNR) to produce multi-level semantic features for future +environments, which are more robust and efficient than pixel-wise RGB +reconstruction. Furthermore, with the predicted future environmental +representations, our lookahead VLN model is able to construct the navigable +future path tree and select the optimal path via efficient parallel evaluation. +Extensive experiments on the VLN-CE datasets confirm the effectiveness of our +method.",cs.CV,"['cs.CV', 'cs.RO']" +CAT-Seg: Cost Aggregation for Open-vocabulary Semantic Segmentation,Seokju Cho · Heeseong Shin · Sunghwan Hong · Anurag Arnab · Paul Hongsuck Seo · Seungryong Kim, ,,https://openreview.net/forum?id=ZWytHTcnTy,,,,,nan +Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments,Liyuan Zhu · Shengyu Huang · Konrad Schindler · Iro Armeni,https://www.zhuliyuan.net/livingscenes,https://arxiv.org/abs/2312.09138,,2312.09138.pdf,Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments,"Research into dynamic 3D scene understanding has primarily focused on +short-term change tracking from dense observations, while little attention has +been paid to long-term changes with sparse observations. We address this gap +with MoRE, a novel approach for multi-object relocalization and reconstruction +in evolving environments. We view these environments as ""living scenes"" and +consider the problem of transforming scans taken at different points in time +into a 3D reconstruction of the object instances, whose accuracy and +completeness increase over time. At the core of our method lies an +SE(3)-equivariant representation in a single encoder-decoder network, trained +on synthetic data. This representation enables us to seamlessly tackle instance +matching, registration, and reconstruction. We also introduce a joint +optimization algorithm that facilitates the accumulation of point clouds +originating from the same instance across multiple scans taken at different +points in time. We validate our method on synthetic and real-world data and +demonstrate state-of-the-art performance in both end-to-end performance and +individual subtasks.",cs.CV,['cs.CV'] +CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation,Kangfu Mei · Mauricio Delbracio · Hossein Talebi · Zhengzhong Tu · Vishal M. Patel · Peyman Milanfar,https://fast-codi.github.io/,https://arxiv.org/abs/2310.01407,,2310.01407.pdf,CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation,"Large generative diffusion models have revolutionized text-to-image +generation and offer immense potential for conditional generation tasks such as +image enhancement, restoration, editing, and compositing. However, their +widespread adoption is hindered by the high computational cost, which limits +their real-time application. To address this challenge, we introduce a novel +method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept +additional image conditioning inputs while significantly reducing the sampling +steps required to achieve high-quality results. Our method can leverage +architectures such as ControlNet to incorporate conditioning inputs without +compromising the model's prior knowledge gained during large scale +pre-training. Additionally, a conditional consistency loss enforces consistent +predictions across diffusion steps, effectively compelling the model to +generate high-quality images with conditions in a few steps. Our +conditional-task learning and distillation approach outperforms previous +distillation methods, achieving a new state-of-the-art in producing +high-quality images with very few steps (e.g., 1-4) across multiple tasks, +including super-resolution, text-guided image editing, and depth-to-image +generation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows,Zhenggang Tang · Jason Ren · Xiaoming Zhao · Bowen Wen · Jonathan Tremblay · Stan Birchfield · Alexander G. Schwing, ,https://arxiv.org/abs/2405.05010,,2405.05010.pdf,${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields,"Neural fields (NeRF) have emerged as a promising approach for representing +continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs +poses a significant challenge for scene decomposition. To address this +challenge, we present a single model, Multi-Modal Decomposition NeRF +(${M^2D}$NeRF), that is capable of both text-based and visual patch-based +edits. Specifically, we use multi-modal feature distillation to integrate +teacher features from pretrained visual and language models into 3D semantic +feature volumes, thereby facilitating consistent 3D editing. To enforce +consistency between the visual and language features in our 3D feature volumes, +we introduce a multi-modal similarity constraint. We also introduce a +patch-based joint contrastive loss that helps to encourage object-regions to +coalesce in the 3D feature space, resulting in more precise boundaries. +Experiments on various real-world scenes show superior performance in 3D scene +decomposition tasks compared to prior NeRF-based methods.",cs.CV,['cs.CV'] +Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements,Niccolò Biondi · Federico Pernici · Simone Ricci · Alberto Del Bimbo,https://github.com/miccunifi/iamcl2r,https://arxiv.org/abs/2405.02581,,2405.02581.pdf,Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements,"Learning compatible representations enables the interchangeable use of +semantic features as models are updated over time. This is particularly +relevant in search and retrieval systems where it is crucial to avoid +reprocessing of the gallery images with the updated model. While recent +research has shown promising empirical evidence, there is still a lack of +comprehensive theoretical understanding about learning compatible +representations. In this paper, we demonstrate that the stationary +representations learned by the $d$-Simplex fixed classifier optimally +approximate compatibility representation according to the two inequality +constraints of its formal definition. This not only establishes a solid +foundation for future works in this line of research but also presents +implications that can be exploited in practical learning scenarios. An +exemplary application is the now-standard practice of downloading and +fine-tuning new pre-trained models. Specifically, we show the strengths and +critical issues of stationary representations in the case in which a model +undergoing sequential fine-tuning is asynchronously replaced by downloading a +better-performing model pre-trained elsewhere. Such a representation enables +seamless delivery of retrieval service (i.e., no reprocessing of gallery +images) and offers improved performance without operational disruptions during +model replacement. Code available at: https://github.com/miccunifi/iamcl2r.",cs.CV,['cs.CV'] +Boosting Flow-based Generative Super-Resolution Models via Learned Prior,Li-Yuan Tsao · Yi-Chen Lo · Chia-Che Chang · Hao-Wei Chen · Roy Tseng · Chien Feng · Chun-Yi Lee,https://github.com/liyuantsao/FlowSR-LP,https://arxiv.org/abs/2403.10988,,2403.10988.pdf,Boosting Flow-based Generative Super-Resolution Models via Learned Prior,"Flow-based super-resolution (SR) models have demonstrated astonishing +capabilities in generating high-quality images. However, these methods +encounter several challenges during image generation, such as grid artifacts, +exploding inverses, and suboptimal results due to a fixed sampling temperature. +To overcome these issues, this work introduces a conditional learned prior to +the inference phase of a flow-based SR model. This prior is a latent code +predicted by our proposed latent module conditioned on the low-resolution +image, which is then transformed by the flow model into an SR image. Our +framework is designed to seamlessly integrate with any contemporary flow-based +SR model without modifying its architecture or pre-trained weights. We evaluate +the effectiveness of our proposed framework through extensive experiments and +ablation analyses. The proposed framework successfully addresses all the +inherent issues in flow-based SR models and enhances their performance in +various SR scenarios. Our code is available at: +https://github.com/liyuantsao/BFSR",cs.CV,"['cs.CV', 'cs.AI']" +Video Frame Interpolation via Direct Synthesis with the Event-based Reference,Yuhan Liu · Yongjian Deng · Hao Chen · Zhen Yang, ,https://arxiv.org/abs/2404.18156,,2404.18156.pdf,Event-based Video Frame Interpolation with Edge Guided Motion Refinement,"Video frame interpolation, the process of synthesizing intermediate frames +between sequential video frames, has made remarkable progress with the use of +event cameras. These sensors, with microsecond-level temporal resolution, fill +information gaps between frames by providing precise motion cues. However, +contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often +neglect the fact that event data primarily supply high-confidence features at +scene edges during multi-modal feature fusion, thereby diminishing the role of +event signals in optical flow (OF) estimation and warping refinement. To +address this overlooked aspect, we introduce an end-to-end E-VFI learning +method (referred to as EGMR) to efficiently utilize edge features from event +signals for motion flow and warping enhancement. Our method incorporates an +Edge Guided Attentive (EGA) module, which rectifies estimated video motion +through attentive aggregation based on the local correlation of multi-modal +features in a coarse-to-fine strategy. Moreover, given that event data can +provide accurate visual references at scene edges between consecutive frames, +we introduce a learned visibility map derived from event data to adaptively +mitigate the occlusion problem in the warping refinement process. Extensive +experiments on both synthetic and real datasets show the effectiveness of the +proposed approach, demonstrating its potential for higher quality video frame +interpolation.",cs.CV,['cs.CV'] +Universal Robustness via Median Random Smoothing for Real-World Super-Resolution,Zakariya Chaouai · Mohamed Tamaazousti, ,https://arxiv.org/abs/2405.14934,,2405.14934.pdf,Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution,"Most of the recent literature on image Super-Resolution (SR) can be +classified into two main approaches. The first one involves learning a +corruption model tailored to a specific dataset, aiming to mimic the noise and +corruption in low-resolution images, such as sensor noise. However, this +approach is data-specific, tends to lack adaptability, and its accuracy +diminishes when faced with unseen types of image corruptions. A second and more +recent approach, referred to as Robust Super-Resolution (RSR), proposes to +improve real-world SR by harnessing the generalization capabilities of a model +by making it robust to adversarial attacks. To delve further into this second +approach, our paper explores the universality of various methods for enhancing +the robustness of deep learning SR models. In other words, we inquire: ""Which +robustness method exhibits the highest degree of adaptability when dealing with +a wide range of adversarial attacks ?"". Our extensive experimentation on both +synthetic and real-world images empirically demonstrates that median randomized +smoothing (MRS) is more general in terms of robustness compared to adversarial +learning techniques, which tend to focus on specific types of attacks. +Furthermore, as expected, we also illustrate that the proposed universal robust +method enables the SR model to handle standard corruptions more effectively, +such as blur and Gaussian noise, and notably, corruptions naturally present in +real-world images. These results support the significance of shifting the +paradigm in the development of real-world SR methods towards RSR, especially +via MRS.",eess.IV,"['eess.IV', 'cs.CV']" +AAMDM: Accelerated Auto-regressive Motion Diffusion Model,Tianyu Li · Calvin Zhuhan Qiao · Ren Guanqiao · KangKang Yin · Sehoon Ha, ,https://arxiv.org/abs/2401.06146,,2401.06146.pdf,AAMDM: Accelerated Auto-regressive Motion Diffusion Model,"Interactive motion synthesis is essential in creating immersive experiences +in entertainment applications, such as video games and virtual reality. +However, generating animations that are both high-quality and contextually +responsive remains a challenge. Traditional techniques in the game industry can +produce high-fidelity animations but suffer from high computational costs and +poor scalability. Trained neural network models alleviate the memory and speed +issues, yet fall short on generating diverse motions. Diffusion models offer +diverse motion synthesis with low memory usage, but require expensive reverse +diffusion processes. This paper introduces the Accelerated Auto-regressive +Motion Diffusion Model (AAMDM), a novel motion synthesis framework designed to +achieve quality, diversity, and efficiency all together. AAMDM integrates +Denoising Diffusion GANs as a fast Generation Module, and an Auto-regressive +Diffusion Model as a Polishing Module. Furthermore, AAMDM operates in a +lower-dimensional embedded space rather than the full-dimensional pose space, +which reduces the training complexity as well as further improves the +performance. We show that AAMDM outperforms existing methods in motion quality, +diversity, and runtime efficiency, through comprehensive quantitative analyses +and visual comparisons. We also demonstrate the effectiveness of each +algorithmic component through ablation studies.",cs.CV,"['cs.CV', 'cs.GR']" +SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion,Hsuan-I Ho · Jie Song · Otmar Hilliges,https://ait.ethz.ch/sith,https://arxiv.org/abs/2311.15855,,2311.15855.pdf,SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion,"A long-standing goal of 3D human reconstruction is to create lifelike and +fully detailed 3D humans from single-view images. The main challenge lies in +inferring unknown body shapes, appearances, and clothing details in areas not +visible in the images. To address this, we propose SiTH, a novel pipeline that +uniquely integrates an image-conditioned diffusion model into a 3D mesh +reconstruction workflow. At the core of our method lies the decomposition of +the challenging single-view reconstruction problem into generative +hallucination and reconstruction subproblems. For the former, we employ a +powerful generative diffusion model to hallucinate unseen back-view appearance +based on the input images. For the latter, we leverage skinned body meshes as +guidance to recover full-body texture meshes from the input and back-view +images. SiTH requires as few as 500 3D human scans for training while +maintaining its generality and robustness to diverse images. Extensive +evaluations on two 3D human benchmarks, including our newly created one, +highlighted our method's superior accuracy and perceptual quality in 3D +textured human reconstruction. Our code and evaluation benchmark are available +at https://ait.ethz.ch/sith",cs.CV,['cs.CV'] +HUGS: Human Gaussian Splatting,Muhammed Kocabas · Jen-Hao Rick Chang · James Gabriel · Oncel Tuzel · Anurag Ranjan,https://machinelearning.apple.com/research/hugs,https://arxiv.org/abs/2311.17910v1,,2311.17910v1.pdf,HUGS: Human Gaussian Splats,"Recent advances in neural rendering have improved both training and rendering +times by orders of magnitude. While these methods demonstrate state-of-the-art +quality and speed, they are designed for photogrammetry of static scenes and do +not generalize well to freely moving humans in the environment. In this work, +we introduce Human Gaussian Splats (HUGS) that represents an animatable human +together with the scene using 3D Gaussian Splatting (3DGS). Our method takes +only a monocular video with a small number of (50-100) frames, and it +automatically learns to disentangle the static scene and a fully animatable +human avatar within 30 minutes. We utilize the SMPL body model to initialize +the human Gaussians. To capture details that are not modeled by SMPL (e.g. +cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. +Utilizing 3D Gaussians for animated humans brings new challenges, including the +artifacts created when articulating the Gaussians. We propose to jointly +optimize the linear blend skinning weights to coordinate the movements of +individual Gaussians during animation. Our approach enables novel-pose +synthesis of human and novel view synthesis of both the human and the scene. We +achieve state-of-the-art rendering quality with a rendering speed of 60 FPS +while being ~100x faster to train over previous work. Our code will be +announced here: https://github.com/apple/ml-hugs",cs.CV,"['cs.CV', 'cs.GR']" +SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency,Paul Roetzer · Florian Bernard, ,https://arxiv.org/abs/2310.08230,,2310.08230.pdf,Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching,"In this work we propose to combine the advantages of learning-based and +combinatorial formalisms for 3D shape matching. While learning-based shape +matching solutions lead to state-of-the-art matching performance, they do not +ensure geometric consistency, so that obtained matchings are locally unsmooth. +On the contrary, axiomatic methods allow to take geometric consistency into +account by explicitly constraining the space of valid matchings. However, +existing axiomatic formalisms are impractical since they do not scale to +practically relevant problem sizes, or they require user input for the +initialisation of non-convex optimisation problems. In this work we aim to +close this gap by proposing a novel combinatorial solver that combines a unique +set of favourable properties: our approach is (i) initialisation free, (ii) +massively parallelisable powered by a quasi-Newton method, (iii) provides +optimality gaps, and (iv) delivers decreased runtime and globally optimal +results for many instances.",cs.CV,['cs.CV'] +Building Optimal Neural Architectures using Interpretable Knowledge,Keith Mills · Fred Han · Mohammad Salameh · Shengyao Lu · CHUNHUA ZHOU · Jiao He · Fengyu Sun · Di Niu,https://github.com/Ascend-Research/AutoBuild,https://arxiv.org/abs/2403.13293,,2403.13293.pdf,Building Optimal Neural Architectures using Interpretable Knowledge,"Neural Architecture Search is a costly practice. The fact that a search space +can span a vast number of design choices with each architecture evaluation +taking nontrivial overhead makes it hard for an algorithm to sufficiently +explore candidate networks. In this paper, we propose AutoBuild, a scheme which +learns to align the latent embeddings of operations and architecture modules +with the ground-truth performance of the architectures they appear in. By doing +so, AutoBuild is capable of assigning interpretable importance scores to +architecture modules, such as individual operation features and larger macro +operation sequences such that high-performance neural networks can be +constructed without any need for search. Through experiments performed on +state-of-the-art image classification, segmentation, and Stable Diffusion +models, we show that by mining a relatively small set of evaluated +architectures, AutoBuild can learn to build high-quality architectures directly +or help to reduce search space to focus on relevant areas, finding better +architectures that outperform both the original labeled ones and ones found by +search baselines. Code available at +https://github.com/Ascend-Research/AutoBuild",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model,Kai Yang · Jian Tao · Jiafei Lyu · Chunjiang Ge · Jiaxin Chen · Weihan Shen · Xiaolong Zhu · Xiu Li,https://github.com/yk7333/d3po/,https://arxiv.org/abs/2311.13231,,2311.13231.pdf,Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model,"Using reinforcement learning with human feedback (RLHF) has shown significant +promise in fine-tuning diffusion models. Previous methods start by training a +reward model that aligns with human preferences, then leverage RL techniques to +fine-tune the underlying models. However, crafting an efficient reward model +demands extensive datasets, optimal architecture, and manual hyperparameter +tuning, making the process both time and cost-intensive. The direct preference +optimization (DPO) method, effective in fine-tuning large language models, +eliminates the necessity for a reward model. However, the extensive GPU memory +requirement of the diffusion model's denoising process hinders the direct +application of the DPO method. To address this issue, we introduce the Direct +Preference for Denoising Diffusion Policy Optimization (D3PO) method to +directly fine-tune diffusion models. The theoretical analysis demonstrates that +although D3PO omits training a reward model, it effectively functions as the +optimal reward model trained using human feedback data to guide the learning +process. This approach requires no training of a reward model, proving to be +more direct, cost-effective, and minimizing computational overhead. In +experiments, our method uses the relative scale of objectives as a proxy for +human preference, delivering comparable results to methods using ground-truth +rewards. Moreover, D3PO demonstrates the ability to reduce image distortion +rates and generate safer images, overcoming challenges lacking robust reward +models. Our code is publicly available at https://github.com/yk7333/D3PO.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,Anna Kukleva · Fadime Sener · Edoardo Remelli · Bugra Tekin · Eric Sauser · Bernt Schiele · Shugao Ma, ,https://arxiv.org/abs/2403.19811,,2403.19811.pdf,X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,"Lately, there has been growing interest in adapting vision-language models +(VLMs) to image and third-person video classification due to their success in +zero-shot recognition. However, the adaptation of these models to egocentric +videos has been largely unexplored. To address this gap, we propose a simple +yet effective cross-modal adaptation framework, which we call X-MIC. Using a +video adapter, our pipeline learns to align frozen text embeddings to each +egocentric video directly in the shared embedding space. Our novel adapter +architecture retains and improves generalization of the pre-trained VLMs by +disentangling learnable temporal modeling and frozen visual encoder. This +results in an enhanced alignment of text embeddings to each egocentric video, +leading to a significant improvement in cross-dataset generalization. We +evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for +fine-grained cross-dataset action generalization, demonstrating the +effectiveness of our method. Code is available at +https://github.com/annusha/xmic",cs.CV,['cs.CV'] +MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation,Zhicheng Zhang · Pancheng Zhao · Eunil Park · Jufeng Yang, ,https://arxiv.org/abs/2306.15876,,2306.15876.pdf,Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners,"Representation learning has been evolving from traditional supervised +training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous +works have demonstrated their pros and cons in specific scenarios, i.e., CL and +supervised pre-training excel at capturing longer-range global patterns and +enabling better feature discrimination, while MIM can introduce more local and +diverse attention across all transformer layers. In this paper, we explore how +to obtain a model that combines their strengths. We start by examining previous +feature distillation and mask feature reconstruction methods and identify their +limitations. We find that their increasing diversity mainly derives from the +asymmetric designs, but these designs may in turn compromise the discrimination +ability. In order to better obtain both discrimination and diversity, we +propose a simple but effective Hybrid Distillation strategy, which utilizes +both the supervised/CL teacher and the MIM teacher to jointly guide the student +model. Hybrid Distill imitates the token relations of the MIM teacher to +alleviate attention collapse, as well as distills the feature maps of the +supervised/CL teacher to enable discrimination. Furthermore, a progressive +redundant token masking strategy is also utilized to reduce the distilling +costs and avoid falling into local optima. Experiment results prove that Hybrid +Distill can achieve superior performance on different benchmarks.",cs.CV,['cs.CV'] +Monocular Identity-Conditioned Facial Reflectance Reconstruction,Xingyu Ren · Jiankang Deng · Yuhao Cheng · Jia Guo · Chao Ma · Yichao Yan · Wenhan Zhu · Xiaokang Yang,https://xingyuren.github.io/id2reflectance/,https://arxiv.org/abs/2404.00301,,2404.00301.pdf,Monocular Identity-Conditioned Facial Reflectance Reconstruction,"Recent 3D face reconstruction methods have made remarkable advancements, yet +there remain huge challenges in monocular high-quality facial reflectance +reconstruction. Existing methods rely on a large amount of light-stage captured +data to learn facial reflectance models. However, the lack of subject diversity +poses challenges in achieving good generalization and widespread applicability. +In this paper, we learn the reflectance prior in image space rather than UV +space and present a framework named ID2Reflectance. Our framework can directly +estimate the reflectance maps of a single image while using limited reflectance +data for training. Our key insight is that reflectance data shares facial +structures with RGB faces, which enables obtaining expressive facial prior from +inexpensive RGB data thus reducing the dependency on reflectance data. We first +learn a high-quality prior for facial reflectance. Specifically, we pretrain +multi-domain facial feature codebooks and design a codebook fusion method to +align the reflectance and RGB domains. Then, we propose an identity-conditioned +swapping module that injects facial identity from the target image into the +pre-trained autoencoder to modify the identity of the source reflectance image. +Finally, we stitch multi-view swapped reflectance images to obtain renderable +assets. Extensive experiments demonstrate that our method exhibits excellent +generalization capability and achieves state-of-the-art facial reflectance +reconstruction results for in-the-wild faces. Our project page is +https://xingyuren.github.io/id2reflectance/.",cs.CV,['cs.CV'] +CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing,Ajian Liu · Shuai Xue · Gan Jianwen · Jun Wan · Yanyan Liang · Jiankang Deng · Sergio Escalera · Zhen Lei, ,https://arxiv.org/abs/2403.14333,,2403.14333.pdf,CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing,"Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the +model's performance on unseen domains. Existing methods either rely on domain +labels to align domain-invariant feature spaces, or disentangle generalizable +features from the whole sample, which inevitably lead to the distortion of +semantic feature structures and achieve limited generalization. In this work, +we make use of large-scale VLMs like CLIP and leverage the textual feature to +dynamically adjust the classifier's weights for exploring generalizable visual +features. Specifically, we propose a novel Class Free Prompt Learning (CFPL) +paradigm for DG FAS, which utilizes two lightweight transformers, namely +Content Q-Former (CQF) and Style Q-Former (SQF), to learn the different +semantic prompts conditioned on content and style features by using a set of +learnable query vectors, respectively. Thus, the generalizable prompt can be +learned by two improvements: (1) A Prompt-Text Matched (PTM) supervision is +introduced to ensure CQF learns visual representation that is most informative +of the content description. (2) A Diversified Style Prompt (DSP) technology is +proposed to diversify the learning of style prompts by mixing feature +statistics between instance-specific styles. Finally, the learned text features +modulate visual features to generalization through the designed Prompt +Modulation (PM). Extensive experiments show that the CFPL is effective and +outperforms the state-of-the-art methods on several cross-domain datasets.",cs.CV,['cs.CV'] +BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning,Hongwei Zheng · Linyuan Zhou · Han Li · Jinming Su · Xiaoming Wei · Xu Xiaoming, ,https://arxiv.org/abs/2404.01179,,2404.01179.pdf,BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning,"Data mixing methods play a crucial role in semi-supervised learning (SSL), +but their application is unexplored in long-tailed semi-supervised learning +(LTSSL). The primary reason is that the in-batch mixing manner fails to address +class imbalance. Furthermore, existing LTSSL methods mainly focus on +re-balancing data quantity but ignore class-wise uncertainty, which is also +vital for class balance. For instance, some classes with sufficient samples +might still exhibit high uncertainty due to indistinguishable features. To this +end, this paper introduces the Balanced and Entropy-based Mix (BEM), a +pioneering mixing approach to re-balance the class distribution of both data +quantity and uncertainty. Specifically, we first propose a class balanced mix +bank to store data of each class for mixing. This bank samples data based on +the estimated quantity distribution, thus re-balancing data quantity. Then, we +present an entropy-based learning approach to re-balance class-wise +uncertainty, including entropy-based sampling strategy, entropy-based selection +module, and entropy-based class balanced loss. Our BEM first leverages data +mixing for improving LTSSL, and it can also serve as a complement to the +existing re-balancing methods. Experimental results show that BEM significantly +enhances various LTSSL frameworks and achieves state-of-the-art performances +across multiple benchmarks.",cs.CV,"['cs.CV', 'cs.LG']" +Relightable Gaussian Codec Avatars,Shunsuke Saito · Gabriel Schwartz · Tomas Simon · Junxuan Li · Giljoo Nam, ,https://arxiv.org/abs/2312.03704,,2312.03704.pdf,Relightable Gaussian Codec Avatars,"The fidelity of relighting is bounded by both geometry and appearance +representations. For geometry, both mesh and volumetric approaches have +difficulty modeling intricate structures like 3D hair geometry. For appearance, +existing relighting models are limited in fidelity and often too slow to render +in real-time with high-resolution continuous environments. In this work, we +present Relightable Gaussian Codec Avatars, a method to build high-fidelity +relightable head avatars that can be animated to generate novel expressions. +Our geometry model based on 3D Gaussians can capture 3D-consistent +sub-millimeter details such as hair strands and pores on dynamic face +sequences. To support diverse materials of human heads such as the eyes, skin, +and hair in a unified manner, we present a novel relightable appearance model +based on learnable radiance transfer. Together with global illumination-aware +spherical harmonics for the diffuse components, we achieve real-time relighting +with all-frequency reflections using spherical Gaussians. This appearance model +can be efficiently relit under both point light and continuous illumination. We +further improve the fidelity of eye reflections and enable explicit gaze +control by introducing relightable explicit eye models. Our method outperforms +existing approaches without compromising real-time performance. We also +demonstrate real-time relighting of avatars on a tethered consumer VR headset, +showcasing the efficiency and fidelity of our avatars.",cs.GR,"['cs.GR', 'cs.CV']" +4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations,Wenbo Wang · Hsuan-I Ho · Chen Guo · Boxiang Rong · Artur Grigorev · Jie Song · Juan Jose Zarate · Otmar Hilliges,https://ait.ethz.ch/4d-dress,https://arxiv.org/abs/2404.18630,,2404.18630.pdf,4D-DRESS: A 4D Dataset of Real-world Human Clothing with Semantic Annotations,"The studies of human clothing for digital avatars have predominantly relied +on synthetic datasets. While easy to collect, synthetic data often fall short +in realism and fail to capture authentic clothing dynamics. Addressing this +gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human +clothing research with its high-quality 4D textured scans and garment meshes. +4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k +textured scans. Creating a real-world clothing dataset is challenging, +particularly in annotating and segmenting the extensive and complex 4D human +scans. To address this, we develop a semi-automatic 4D human parsing pipeline. +We efficiently combine a human-in-the-loop process with automation to +accurately label 4D scans in diverse garments and body movements. Leveraging +precise annotations and high-quality garment meshes, we establish several +benchmarks for clothing simulation and reconstruction. 4D-DRESS offers +realistic and challenging data that complements synthetic sources, paving the +way for advancements in research of lifelike human clothing. Website: +https://ait.ethz.ch/4d-dress.",cs.CV,['cs.CV'] +DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback,Yangyi Chen · Karan Sikka · Michael Cogswell · Heng Ji · Ajay Divakaran, ,https://arxiv.org/abs/2311.10081,,2311.10081.pdf,DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback,"We present DRESS, a large vision language model (LVLM) that innovatively +exploits Natural Language feedback (NLF) from Large Language Models to enhance +its alignment and interactions by addressing two key limitations in the +state-of-the-art LVLMs. First, prior LVLMs generally rely only on the +instruction finetuning stage to enhance alignment with human preferences. +Without incorporating extra feedback, they are still prone to generate +unhelpful, hallucinated, or harmful responses. Second, while the visual +instruction tuning data is generally structured in a multi-turn dialogue +format, the connections and dependencies among consecutive conversational turns +are weak. This reduces the capacity for effective multi-turn interactions. To +tackle these, we propose a novel categorization of the NLF into two key types: +critique and refinement. The critique NLF identifies the strengths and +weaknesses of the responses and is used to align the LVLMs with human +preferences. The refinement NLF offers concrete suggestions for improvement and +is adopted to improve the interaction ability of the LVLMs-- which focuses on +LVLMs' ability to refine responses by incorporating feedback in multi-turn +interactions. To address the non-differentiable nature of NLF, we generalize +conditional reinforcement learning for training. Our experimental results +demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and +harmless (21.03%) responses, and more effectively learn from feedback during +multi-turn interactions compared to SOTA LVMLs.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +Gaussian Shadow Casting for Neural Characters,Luis Bolanos · Shih-Yang Su · Helge Rhodin, ,https://arxiv.org/abs/2401.06116v1,,2401.06116v1.pdf,Gaussian Shadow Casting for Neural Characters,"Neural character models can now reconstruct detailed geometry and texture +from video, but they lack explicit shadows and shading, leading to artifacts +when generating novel views and poses or during relighting. It is particularly +difficult to include shadows as they are a global effect and the required +casting of secondary rays is costly. We propose a new shadow model using a +Gaussian density proxy that replaces sampling with a simple analytic formula. +It supports dynamic motion and is tailored for shadow computation, thereby +avoiding the affine projection approximation and sorting required by the +closely related Gaussian splatting. Combined with a deferred neural rendering +model, our Gaussian shadows enable Lambertian shading and shadow casting with +minimal overhead. We demonstrate improved reconstructions, with better +separation of albedo, shading, and shadows in challenging outdoor scenes with +direct sun light and hard shadows. Our method is able to optimize the light +direction without any input from the user. As a result, novel poses have fewer +shadow artifacts and relighting in novel scenes is more realistic compared to +the state-of-the-art methods, providing new ways to pose neural characters in +novel environments, increasing their applicability.",cs.CV,['cs.CV'] +CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update,Zhi Gao · Yuntao Du. · Xintong Zhang · Xiaojian Ma · Wenjuan Han · Song-Chun Zhu · Qing Li, ,https://arxiv.org/abs/2312.10908,,2312.10908.pdf,CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update,"Utilizing large language models (LLMs) to compose off-the-shelf visual tools +represents a promising avenue of research for developing robust visual +assistants capable of addressing diverse visual tasks. However, these methods +often overlook the potential for continual learning, typically by freezing the +utilized tools, thus limiting their adaptation to environments requiring new +knowledge. To tackle this challenge, we propose CLOVA, a Closed-Loop Visual +Assistant, which operates within a framework encompassing inference, +reflection, and learning phases. During the inference phase, LLMs generate +programs and execute corresponding tools to complete assigned tasks. In the +reflection phase, a multimodal global-local reflection scheme analyzes human +feedback to determine which tools require updating. Lastly, the learning phase +employs three flexible approaches to automatically gather training data and +introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to +efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA +surpasses existing tool-usage methods by 5% in visual question answering and +multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image +editing. These results underscore the significance of the continual learning +capability in general visual assistants.",cs.CV,['cs.CV'] +Enhancing the Power of OOD Detection via Sample-Aware Model Selection,Feng Xue · Zi He · Yuan Zhang · Chuanlong Xie · Zhenguo Li · Falong Tan, ,,https://www.youtube.com/watch?v=XNso9qsWxHo,,,,,nan +Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation,Zhekai Du · Xinyao Li · Fengling Li · Ke Lu · Lei Zhu · Jingjing Li,https://github.com/TL-UESTC/DAMP,https://arxiv.org/abs/2403.02899,,2403.02899.pdf,Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation,"Conventional Unsupervised Domain Adaptation (UDA) strives to minimize +distribution discrepancy between domains, which neglects to harness rich +semantics from data and struggles to handle complex domain shifts. A promising +technique is to leverage the knowledge of large-scale pre-trained +vision-language models for more guided adaptation. Despite some endeavors, +current methods often learn textual prompts to embed domain semantics for +source and target domains separately and perform classification within each +domain, limiting cross-domain knowledge transfer. Moreover, prompting only the +language branch lacks flexibility to adapt both modalities dynamically. To +bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit +domain-invariant semantics by mutually aligning visual and textual embeddings. +Specifically, the image contextual information is utilized to prompt the +language branch in a domain-agnostic and instance-conditioned way. Meanwhile, +visual prompts are imposed based on the domain-agnostic textual prompt to +elicit domain-invariant visual embeddings. These two branches of prompts are +learned mutually with a cross-attention module and regularized with a +semantic-consistency loss and an instance-discrimination contrastive loss. +Experiments on three UDA benchmarks demonstrate the superiority of DAMP over +state-of-the-art approaches.",cs.AI,['cs.AI'] +DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction,Weiyi Lv · Yuhang Huang · NING Zhang · Ruei-Sung Lin · Mei Han · Dan Zeng,https://diffmot.github.io/,https://arxiv.org/abs/2403.02075,,2403.02075.pdf,DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction,"In Multiple Object Tracking, objects often exhibit non-linear motion of +acceleration and deceleration, with irregular direction changes. +Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work +well in pedestrian-dominant scenarios but fall short in complex situations when +multiple objects perform non-linear and diverse motion simultaneously. To +tackle the complex non-linear motion, we propose a real-time diffusion-based +MOT approach named DiffMOT. Specifically, for the motion predictor component, +we propose a novel Decoupled Diffusion-based Motion Predictor (D$^2$MP). It +models the entire distribution of various motion presented by the data as a +whole. It also predicts an individual object's motion conditioning on an +individual's historical motion information. Furthermore, it optimizes the +diffusion process with much fewer sampling steps. As a MOT tracker, the DiffMOT +is real-time at 22.7FPS, and also outperforms the state-of-the-art on +DanceTrack and SportsMOT datasets with $62.3\%$ and $76.2\%$ in HOTA metrics, +respectively. To the best of our knowledge, DiffMOT is the first to introduce a +diffusion probabilistic model into the MOT to tackle non-linear motion +prediction.",cs.CV,['cs.CV'] +Dynamic Support Information Mining for Category-Agnostic Pose Estimation,Pengfei Ren · Yuanyuan Gao · Haifeng Sun · Qi Qi · Jingyu Wang · Jianxin Liao, ,https://arxiv.org/abs/2403.13647,,,Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,"Category-agnostic pose estimation (CAPE) aims to predict keypoints for +arbitrary classes given a few support images annotated with keypoints. Existing +methods only rely on the features extracted at support keypoints to predict or +refine the keypoints on query image, but a few support feature vectors are +local and inadequate for CAPE. Considering that human can quickly perceive +potential keypoints of arbitrary objects, we propose a novel framework for CAPE +based on such potential keypoints (named as meta-points). Specifically, we +maintain learnable embeddings to capture inherent information of various +keypoints, which interact with image feature maps to produce meta-points +without any support. The produced meta-points could serve as meaningful +potential keypoints for CAPE. Due to the inevitable gap between inherency and +annotation, we finally utilize the identities and details offered by support +keypoints to assign and refine meta-points to desired keypoints in query image. +In addition, we propose a progressive deformable point decoder and a slacked +regression loss for better prediction and supervision. Our novel framework not +only reveals the inherency of keypoints but also outperforms existing methods +of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100 +dataset demonstrate the effectiveness of our framework.",cs.CV,['cs.CV'] +Implicit Event-RGBD Neural SLAM,Delin Qu · Chi Yan · Dong Wang · Jie Yin · Qizhi Chen · Dan Xu · Yiting Zhang · Bin Zhao · Xuelong Li,https://delinqu.github.io/EN-SLAM,https://arxiv.org/abs/2311.11013,,2311.11013.pdf,Implicit Event-RGBD Neural SLAM,"Implicit neural SLAM has achieved remarkable progress recently. Nevertheless, +existing methods face significant challenges in non-ideal scenarios, such as +motion blur or lighting variation, which often leads to issues like convergence +failures, localization drifts, and distorted mapping. To address these +challenges, we propose EN-SLAM, the first event-RGBD implicit neural SLAM +framework, which effectively leverages the high rate and high dynamic range +advantages of event data for tracking and mapping. Specifically, EN-SLAM +proposes a differentiable CRF (Camera Response Function) rendering technique to +generate distinct RGB and event camera data via a shared radiance field, which +is optimized by learning a unified implicit representation with the captured +event and RGBD supervision. Moreover, based on the temporal difference property +of events, we propose a temporal aggregating optimization strategy for the +event joint tracking and global bundle adjustment, capitalizing on the +consecutive difference constraints of events, significantly enhancing tracking +accuracy and robustness. Finally, we construct the simulated dataset +DEV-Indoors and real captured dataset DEV-Reals containing 6 scenes, 17 +sequences with practical motion blur and lighting changes for evaluations. +Experimental results show that our method outperforms the SOTA methods in both +tracking ATE and mapping ACC with a real-time 17 FPS in various challenging +environments. Project page: https://delinqu.github.io/EN-SLAM.",cs.CV,['cs.CV'] +DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting,Demin Yu · Xutao Li · Yunming Ye · Baoquan Zhang · Luo Chuyao · Kuai Dai · wangrui · Chenxunlai, ,https://arxiv.org/abs/2312.06734,,2312.06734.pdf,DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting,"Precipitation nowcasting is an important spatio-temporal prediction task to +predict the radar echoes sequences based on current observations, which can +serve both meteorological science and smart city applications. Due to the +chaotic evolution nature of the precipitation systems, it is a very challenging +problem. Previous studies address the problem either from the perspectives of +deterministic modeling or probabilistic modeling. However, their predictions +suffer from the blurry, high-value echoes fading away and position inaccurate +issues. The root reason of these issues is that the chaotic evolutionary +precipitation systems are not appropriately modeled. Inspired by the nature of +the systems, we propose to decompose and model them from the perspective of +global deterministic motion and local stochastic variations with residual +mechanism. A unified and flexible framework that can equip any type of +spatio-temporal models is proposed based on residual diffusion, which +effectively tackles the shortcomings of previous methods. Extensive +experimental results on four publicly available radar datasets demonstrate the +effectiveness and superiority of the proposed framework, compared to +state-of-the-art techniques. Our code is publicly available at +https://github.com/DeminYu98/DiffCast.",cs.CV,['cs.CV'] +Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions,Zeyu Han · Fangrui Zhu · Qianru Lao · Huaizu Jiang, ,https://arxiv.org/abs/2311.17048,,2311.17048.pdf,Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions,"Zero-shot referring expression comprehension aims at localizing bounding +boxes in an image corresponding to provided textual prompts, which requires: +(i) a fine-grained disentanglement of complex visual scene and textual context, +and (ii) a capacity to understand relationships among disentangled entities. +Unfortunately, existing large vision-language alignment (VLA) models, e.g., +CLIP, struggle with both aspects so cannot be directly used for this task. To +mitigate this gap, we leverage large foundation models to disentangle both +images and texts into triplets in the format of (subject, predicate, object). +After that, grounding is accomplished by calculating the structural similarity +matrix between visual and textual triplets with a VLA model, and subsequently +propagate it to an instance-level similarity matrix. Furthermore, to equip VLA +models with the ability of relationship understanding, we design a +triplet-matching objective to fine-tune the VLA models on a collection of +curated dataset containing abundant entity relationships. Experiments +demonstrate that our visual grounding performance increase of up to 19.5% over +the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo +dataset, our zero-shot approach achieves comparable accuracy to the fully +supervised model. Code is available at +https://github.com/Show-han/Zeroshot_REC.",cs.CV,['cs.CV'] +Self-Supervised Multi-Object Tracking with Path Consistency,Zijia Lu · Bing Shuai · Yanbei Chen · Zhenlin Xu · Davide Modolo, ,https://arxiv.org/abs/2404.05136,,2404.05136.pdf,Self-Supervised Multi-Object Tracking with Path Consistency,"In this paper, we propose a novel concept of path consistency to learn robust +object matching without using manual object identity supervision. Our key idea +is that, to track a object through frames, we can obtain multiple different +association results from a model by varying the frames it can observe, i.e., +skipping frames in observation. As the differences in observations do not alter +the identities of objects, the obtained association results should be +consistent. Based on this rationale, we generate multiple observation paths, +each specifying a different set of frames to be skipped, and formulate the Path +Consistency Loss that enforces the association results are consistent across +different observation paths. We use the proposed loss to train our object +matching model with only self-supervision. By extensive experiments on three +tracking datasets (MOT17, PersonPath22, KITTI), we demonstrate that our method +outperforms existing unsupervised methods with consistent margins on various +evaluation metrics, and even achieves performance close to supervised methods.",cs.CV,"['cs.CV', 'cs.AI']" +Correcting Diffusion Generation through Resampling,Yujian Liu · Yang Zhang · Tommi Jaakkola · Shiyu Chang, ,https://arxiv.org/abs/2312.06038,,2312.06038.pdf,Correcting Diffusion Generation through Resampling,"Despite diffusion models' superior capabilities in modeling complex +distributions, there are still non-trivial distributional discrepancies between +generated and ground-truth images, which has resulted in several notable +problems in image generation, including missing object errors in text-to-image +generation and low image quality. Existing methods that attempt to address +these problems mostly do not tend to address the fundamental cause behind these +problems, which is the distributional discrepancies, and hence achieve +sub-optimal results. In this paper, we propose a particle filtering framework +that can effectively address both problems by explicitly reducing the +distributional discrepancies. Specifically, our method relies on a set of +external guidance, including a small set of real images and a pre-trained +object detector, to gauge the distribution gap, and then design the resampling +weight accordingly to correct the gap. Experiments show that our methods can +effectively correct missing object errors and improve image quality in various +image generation tasks. Notably, our method outperforms the existing strongest +baseline by 5% in object occurrence and 1.0 in FID on MS-COCO. Our code is +publicly available at +https://github.com/UCSB-NLP-Chang/diffusion_resampling.git.",cs.CV,"['cs.CV', 'cs.LG']" +Exploring Orthogonality in Open World Object Detection,Zhicheng Sun · Jinghan Li · Yadong Mu,https://github.com/feifeiobama/OrthogonalDet,,https://www.youtube.com/watch?v=fNDF2pIWbmM,,,,,nan +YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection,Alon Zolfi · Guy AmiT · Amit Baras · Satoru Koda · Ikuya Morikawa · Yuval Elovici · Asaf Shabtai, ,https://arxiv.org/abs/2402.18162,,2402.18162.pdf,Out-of-Distribution Detection using Neural Activation Prior,"Out-of-distribution detection (OOD) is a crucial technique for deploying +machine learning models in the real world to handle the unseen scenarios. In +this paper, we first propose a simple yet effective Neural Activation Prior +(NAP) for OOD detection. Our neural activation prior is based on a key +observation that, for a channel before the global pooling layer of a fully +trained neural network, the probability of a few neurons being activated with a +large response by an in-distribution (ID) sample is significantly higher than +that by an OOD sample. An intuitive explanation is that for a model fully +trained on ID dataset, each channel would play a role in detecting a certain +pattern in the ID dataset, and a few neurons can be activated with a large +response when the pattern is detected in an input sample. Then, a new scoring +function based on this prior is proposed to highlight the role of these +strongly activated neurons in OOD detection. Our approach is plug-and-play and +does not lead to any performance degradation on ID data classification and +requires no extra training or statistics from training or external datasets. +Notice that previous methods primarily rely on post-global-pooling features of +the neural networks, while the within-channel distribution information we +leverage would be discarded by the global pooling operator. Consequently, our +method is orthogonal to existing approaches and can be effectively combined +with them in various applications. Experimental results show that our method +achieves the state-of-the-art performance on CIFAR benchmark and ImageNet +dataset, which demonstrates the power of the proposed prior. Finally, we extend +our method to Transformers and the experimental findings indicate that NAP can +also significantly enhance the performance of OOD detection on Transformers, +thereby demonstrating the broad applicability of this prior knowledge.",cs.CV,['cs.CV'] +3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling,Chaokang Jiang · Guangming Wang · Jiuming Liu · Hesheng Wang · Zhuang Ma · Zhenqiang Liu · LIANG · Yi Shan · Dalong Du,https://jiangchaokang.github.io/3DSFLabelling-Page/,https://arxiv.org/abs/2402.18146,,2402.18146.pdf,3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling,"Learning 3D scene flow from LiDAR point clouds presents significant +difficulties, including poor generalization from synthetic datasets to real +scenes, scarcity of real-world 3D labels, and poor performance on real sparse +LiDAR point clouds. We present a novel approach from the perspective of +auto-labelling, aiming to generate a large number of 3D scene flow pseudo +labels for real-world LiDAR point clouds. Specifically, we employ the +assumption of rigid body motion to simulate potential object-level rigid +movements in autonomous driving scenarios. By updating different motion +attributes for multiple anchor boxes, the rigid motion decomposition is +obtained for the whole scene. Furthermore, we developed a novel 3D scene flow +data augmentation method for global and local motion. By perfectly synthesizing +target point clouds based on augmented motion parameters, we easily obtain lots +of 3D scene flow labels in point clouds highly consistent with real scenarios. +On multiple real-world datasets including LiDAR KITTI, nuScenes, and Argoverse, +our method outperforms all previous supervised and unsupervised methods without +requiring manual labelling. Impressively, our method achieves a tenfold +reduction in EPE3D metric on the LiDAR KITTI dataset, reducing it from $0.190m$ +to a mere $0.008m$ error.",cs.CV,['cs.CV'] +Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling,Leon Sick · Dominik Engel · Pedro Hermosilla · Timo Ropinski,https://leonsick.github.io/depthg/,https://arxiv.org/abs/2309.12378,,2309.12378.pdf,Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling,"Traditionally, training neural networks to perform semantic segmentation +required expensive human-made annotations. But more recently, advances in the +field of unsupervised learning have made significant progress on this issue and +towards closing the gap to supervised algorithms. To achieve this, semantic +knowledge is distilled by learning to correlate randomly sampled features from +images across an entire dataset. In this work, we build upon these advances by +incorporating information about the structure of the scene into the training +process through the use of depth information. We achieve this by (1) learning +depth-feature correlation by spatially correlate the feature maps with the +depth maps to induce knowledge about the structure of the scene and (2) +implementing farthest-point sampling to more effectively select relevant +features by utilizing 3D sampling techniques on depth information of the scene. +Finally, we demonstrate the effectiveness of our technical contributions +through extensive experimentation and present significant improvements in +performance across multiple benchmark datasets.",cs.CV,['cs.CV'] +Super-Resolution Reconstruction from Bayer-Pattern Spike Streams,Yanchen Dong · Ruiqin Xiong · Jian Zhang · Zhaofei Yu · Xiaopeng Fan · Shuyuan Zhu · Tiejun Huang,https://github.com/csycdong/CSCSR,,https://ojs.aaai.org/index.php/AAAI/article/view/27924,,,,,nan +Random Entangled Tokens for Adversarially Robust Vision Transformer,Huihui Gong · Minjing Dong · Siqi Ma · Seyit Camtepe · Surya Nepal · Chang Xu, ,https://arxiv.org/abs/2402.07183,,2402.07183.pdf,A Random Ensemble of Encrypted Vision Transformers for Adversarially Robust Defense,"Deep neural networks (DNNs) are well known to be vulnerable to adversarial +examples (AEs). In previous studies, the use of models encrypted with a secret +key was demonstrated to be robust against white-box attacks, but not against +black-box ones. In this paper, we propose a novel method using the vision +transformer (ViT) that is a random ensemble of encrypted models for enhancing +robustness against both white-box and black-box attacks. In addition, a +benchmark attack method, called AutoAttack, is applied to models to test +adversarial robustness objectively. In experiments, the method was demonstrated +to be robust against not only white-box attacks but also black-box ones in an +image classification task on the CIFAR-10 and ImageNet datasets. The method was +also compared with the state-of-the-art in a standardized benchmark for +adversarial robustness, RobustBench, and it was verified to outperform +conventional defenses in terms of clean accuracy and robust accuracy.",cs.AI,['cs.AI'] +PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding,Zhen Li · Mingdeng Cao · Xintao Wang · Zhongang Qi · Ming-Ming Cheng · Ying Shan,https://github.com/TencentARC/PhotoMaker,https://arxiv.org/abs/2312.04461,,2312.04461.pdf,PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding,"Recent advances in text-to-image generation have made remarkable progress in +synthesizing realistic human photos conditioned on given text prompts. However, +existing personalized generation methods cannot simultaneously satisfy the +requirements of high efficiency, promising identity (ID) fidelity, and flexible +text controllability. In this work, we introduce PhotoMaker, an efficient +personalized text-to-image generation method, which mainly encodes an arbitrary +number of input ID images into a stack ID embedding for preserving ID +information. Such an embedding, serving as a unified ID representation, can not +only encapsulate the characteristics of the same input ID comprehensively, but +also accommodate the characteristics of different IDs for subsequent +integration. This paves the way for more intriguing and practically valuable +applications. Besides, to drive the training of our PhotoMaker, we propose an +ID-oriented data construction pipeline to assemble the training data. Under the +nourishment of the dataset constructed through the proposed pipeline, our +PhotoMaker demonstrates better ID preservation ability than test-time +fine-tuning based methods, yet provides significant speed improvements, +high-quality generation results, strong generalization capabilities, and a wide +range of applications. Our project page is available at +https://photo-maker.github.io/",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']" +Hierarchical Correlation Clustering and Tree Preserving Embedding,Morteza Haghir Chehreghani · Mostafa Haghir Chehreghani, ,https://arxiv.org/abs/2402.03587,,2402.03587.pdf,Information-Theoretic Active Correlation Clustering,"We study correlation clustering where the pairwise similarities are not known +in advance. For this purpose, we employ active learning to query pairwise +similarities in a cost-efficient way. We propose a number of effective +information-theoretic acquisition functions based on entropy and information +gain. We extensively investigate the performance of our methods in different +settings and demonstrate their superior performance compared to the +alternatives.",cs.LG,"['cs.LG', 'stat.ML']" +Referring Image Editing: Object-level Image Editing via Referring Expressions,Chang Liu · Xiangtai Li · Henghui Ding, ,,https://link.springer.com/article/10.1007/s11063-024-11487-2,,,,,nan +CORE-MPI: Consistency Object Removal with Embedding MultiPlane Image,Donggeun Yoon · Donghyeon Cho, ,https://arxiv.org/abs/2310.08092,,2310.08092.pdf,Consistent123: Improve Consistency for One Image to 3D Object Synthesis,"Large image diffusion models enable novel view synthesis with high quality +and excellent zero-shot capability. However, such models based on +image-to-image translation have no guarantee of view consistency, limiting the +performance for downstream tasks like 3D reconstruction and image-to-3D +generation. To empower consistency, we propose Consistent123 to synthesize +novel views simultaneously by incorporating additional cross-view attention +layers and the shared self-attention mechanism. The proposed attention +mechanism improves the interaction across all synthesized views, as well as the +alignment between the condition view and novel views. In the sampling stage, +such architecture supports simultaneously generating an arbitrary number of +views while training at a fixed length. We also introduce a progressive +classifier-free guidance strategy to achieve the trade-off between texture and +geometry for synthesized object views. Qualitative and quantitative experiments +show that Consistent123 outperforms baselines in view consistency by a large +margin. Furthermore, we demonstrate a significant improvement of Consistent123 +on varying downstream tasks, showing its great potential in the 3D generation +field. The project page is available at consistent-123.github.io.",cs.CV,['cs.CV'] +WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models,Changhoon Kim · Kyle Min · Maitreya Patel · Sheng Cheng · 'YZ' Yezhou Yang, ,https://arxiv.org/abs/2306.04744,,2306.04744.pdf,WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models,"The rapid advancement of generative models, facilitating the creation of +hyper-realistic images from textual descriptions, has concurrently escalated +critical societal concerns such as misinformation. Although providing some +mitigation, traditional fingerprinting mechanisms fall short in attributing +responsibility for the malicious use of synthetic images. This paper introduces +a novel approach to model fingerprinting that assigns responsibility for the +generated images, thereby serving as a potential countermeasure to model +misuse. Our method modifies generative models based on each user's unique +digital fingerprint, imprinting a unique identifier onto the resultant content +that can be traced back to the user. This approach, incorporating fine-tuning +into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates +near-perfect attribution accuracy with a minimal impact on output quality. +Through extensive evaluation, we show that our method outperforms baseline +methods with an average improvement of 11\% in handling image post-processes. +Our method presents a promising and novel avenue for accountable model +distribution and responsible use. Our code is available in +\url{https://github.com/kylemin/WOUAF}.",cs.CV,['cs.CV'] +Improving Unsupervised Hierarchical Representation with Reinforcement Learning,Ruyi An · Yewen Li · Xu He · Pengjie Gu · Mengchen Zhao · Dong Li · Jianye Hao · Bo An · Chaojie Wang · Mingyuan Zhou, ,,https://www2.scut.edu.cn/sse/2024/0226/c16789a534834/page.htm,,,,,nan +Joint-Task Regularization for Partially Labeled Multi-Task Learning,Kento Nishi · Junsik Kim · Wanhua Li · Hanspeter Pfister,https://kentonishi.com/JTR-CVPR-2024/,https://arxiv.org/abs/2404.01976,,2404.01976.pdf,Joint-Task Regularization for Partially Labeled Multi-Task Learning,"Multi-task learning has become increasingly popular in the machine learning +field, but its practicality is hindered by the need for large, labeled +datasets. Most multi-task learning methods depend on fully labeled datasets +wherein each input example is accompanied by ground-truth labels for all target +tasks. Unfortunately, curating such datasets can be prohibitively expensive and +impractical, especially for dense prediction tasks which require per-pixel +labels for each image. With this in mind, we propose Joint-Task Regularization +(JTR), an intuitive technique which leverages cross-task relations to +simultaneously regularize all tasks in a single joint-task latent space to +improve learning when data is not fully labeled for all tasks. JTR stands out +from existing approaches in that it regularizes all tasks jointly rather than +separately in pairs -- therefore, it achieves linear complexity relative to the +number of tasks while previous methods scale quadratically. To demonstrate the +validity of our approach, we extensively benchmark our method across a wide +variety of partially labeled scenarios based on NYU-v2, Cityscapes, and +Taskonomy.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations,Peng Dai · Yang Zhang · Tao Liu · ZhenFan · Tianyuan Du · Zhuo Su · Xiaozheng Zheng · Zeming Li,https://pico-ai-team.github.io/hmd-poser,https://arxiv.org/abs/2403.03561,,2403.03561.pdf,HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations,"It is especially challenging to achieve real-time human motion tracking on a +standalone VR Head-Mounted Display (HMD) such as Meta Quest and PICO. In this +paper, we propose HMD-Poser, the first unified approach to recover full-body +motions using scalable sparse observations from HMD and body-worn IMUs. In +particular, it can support a variety of input scenarios, such as HMD, +HMD+2IMUs, HMD+3IMUs, etc. The scalability of inputs may accommodate users' +choices for both high tracking accuracy and easy-to-wear. A lightweight +temporal-spatial feature learning network is proposed in HMD-Poser to guarantee +that the model runs in real-time on HMDs. Furthermore, HMD-Poser presents +online body shape estimation to improve the position accuracy of body joints. +Extensive experimental results on the challenging AMASS dataset show that +HMD-Poser achieves new state-of-the-art results in both accuracy and real-time +performance. We also build a new free-dancing motion dataset to evaluate +HMD-Poser's on-device performance and investigate the performance gap between +synthetic data and real-captured sensor data. Finally, we demonstrate our +HMD-Poser with a real-time Avatar-driving application on a commercial HMD. Our +code and free-dancing motion dataset are available +https://pico-ai-team.github.io/hmd-poser",cs.CV,['cs.CV'] +BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model,song yiran · Qianyu Zhou · Xiangtai Li · Deng-Ping Fan · Xuequan Lu · Lizhuang Ma, ,https://arxiv.org/abs/2401.02317,,2401.02317.pdf,BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model,"In this paper, we address the challenge of image resolution variation for the +Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, +exhibits a performance degradation when faced with datasets with varying image +sizes. Previous approaches tend to resize the image to a fixed size or adopt +structure modifications, hindering the preservation of SAM's rich prior +knowledge. Besides, such task-specific tuning necessitates a complete +retraining of the model, which is cost-expensive and unacceptable for +deployment in the downstream tasks. In this paper, we reformulate this issue as +a length extrapolation problem, where token sequence length varies while +maintaining a consistent patch size for images of different sizes. To this end, +we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's +adaptability to varying image resolutions while eliminating the need for +structure modifications. Firstly, we introduce a new scaling factor to ensure +consistent magnitude in the attention layer's dot product values when the token +sequence length changes. Secondly, we present a bias-mode attention mask that +allows each token to prioritize neighboring information, mitigating the impact +of untrained distant information. Our BA-SAM demonstrates efficacy in two +scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, +including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to +significantly mitigate performance degradation in the zero-shot setting and +achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we +propose a generalized model and benchmark, showcasing BA-SAM's generalizability +across all four datasets simultaneously. Code is available at +https://github.com/zongzi13545329/BA-SAM",cs.CV,['cs.CV'] +CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection,Jiayi Zhu · Qing Guo · Felix Juefei Xu · Yihao Huang · Yang Liu · Geguang Pu, ,https://arxiv.org/abs/2403.18554,,2403.18554.pdf,CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection,"Co-salient object detection (CoSOD) aims to identify the common and salient +(usually in the foreground) regions across a given group of images. Although +achieving significant progress, state-of-the-art CoSODs could be easily +affected by some adversarial perturbations, leading to substantial accuracy +reduction. The adversarial perturbations can mislead CoSODs but do not change +the high-level semantic information (e.g., concept) of the co-salient objects. +In this paper, we propose a novel robustness enhancement framework by first +learning the concept of the co-salient objects based on the input group images +and then leveraging this concept to purify adversarial perturbations, which are +subsequently fed to CoSODs for robustness enhancement. Specifically, we propose +CosalPure containing two modules, i.e., group-image concept learning and +concept-guided diffusion purification. For the first module, we adopt a +pre-trained text-to-image diffusion model to learn the concept of co-salient +objects within group images where the learned concept is robust to adversarial +examples. For the second module, we map the adversarial image to the latent +space and then perform diffusion generation by embedding the learned concept +into the noise prediction function as an extra condition. Our method can +effectively alleviate the influence of the SOTA adversarial attack containing +different adversarial patterns, including exposure and noise. The extensive +results demonstrate that our method could enhance the robustness of CoSODs +significantly.",cs.CV,['cs.CV'] +Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation,Hanyang Chi · Jian Pang · Bingfeng Zhang · Weifeng Liu, ,https://arxiv.org/abs/2405.00378,,2405.00378.pdf,Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation,"Consistency learning is a central strategy to tackle unlabeled data in +semi-supervised medical image segmentation (SSMIS), which enforces the model to +produce consistent predictions under the perturbation. However, most current +approaches solely focus on utilizing a specific single perturbation, which can +only cope with limited cases, while employing multiple perturbations +simultaneously is hard to guarantee the quality of consistency learning. In +this paper, we propose an Adaptive Bidirectional Displacement (ABD) approach to +solve the above challenge. Specifically, we first design a bidirectional patch +displacement based on reliable prediction confidence for unlabeled data to +generate new samples, which can effectively suppress uncontrollable regions and +still retain the influence of input perturbations. Meanwhile, to enforce the +model to learn the potentially uncontrollable content, a bidirectional +displacement operation with inverse confidence is proposed for the labeled +images, which generates samples with more unreliable information to facilitate +model learning. Extensive experiments show that ABD achieves new +state-of-the-art performances for SSMIS, significantly improving different +baselines. Source code is available at https://github.com/chy-upc/ABD.",cs.CV,['cs.CV'] +UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather,Haimei Zhao · Jing Zhang · Zhuo Chen · Shanshan Zhao · Dacheng Tao, ,https://arxiv.org/abs/2404.05145,,2404.05145.pdf,UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather,"LiDAR semantic segmentation (LSS) is a critical task in autonomous driving +and has achieved promising progress. However, prior LSS methods are +conventionally investigated and evaluated on datasets within the same domain in +clear weather. The robustness of LSS models in unseen scenes and all weather +conditions is crucial for ensuring safety and reliability in real applications. +To this end, we propose UniMix, a universal method that enhances the +adaptability and generalizability of LSS models. UniMix first leverages +physically valid adverse weather simulation to construct a Bridge Domain, which +serves to bridge the domain gap between the clear weather scenes and the +adverse weather scenes. Then, a Universal Mixing operator is defined regarding +spatial, intensity, and semantic distributions to create the intermediate +domain with mixed samples from given domains. Integrating the proposed two +techniques into a teacher-student framework, UniMix efficiently mitigates the +domain gap and enables LSS models to learn weather-robust and domain-invariant +representations. We devote UniMix to two main setups: 1) unsupervised domain +adaption, adapting the model from the clear weather source domain to the +adverse weather target domain; 2) domain generalization, learning a model that +generalizes well to unseen scenes in adverse weather. Extensive experiments +validate the effectiveness of UniMix across different tasks and datasets, all +achieving superior performance over state-of-the-art methods. The code will be +released.",cs.CV,['cs.CV'] +Estimating Extreme 3D Image Rotations using Cascaded Attention,Shay Dekel · Yosi Keller · Martin Čadík, ,,https://www.youtube.com/watch?v=LzUPefef_8Q,,,,,nan +PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics,Tianyi Xie · Zeshun Zong · Yuxing Qiu · Xuan Li · Yutao Feng · Yin Yang · Chenfanfu Jiang, ,https://arxiv.org/abs/2311.12198,,2311.12198.pdf,PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics,"We introduce PhysGaussian, a new method that seamlessly integrates physically +grounded Newtonian dynamics within 3D Gaussians to achieve high-quality novel +motion synthesis. Employing a custom Material Point Method (MPM), our approach +enriches 3D Gaussian kernels with physically meaningful kinematic deformation +and mechanical stress attributes, all evolved in line with continuum mechanics +principles. A defining characteristic of our method is the seamless integration +between physical simulation and visual rendering: both components utilize the +same 3D Gaussian kernels as their discrete representations. This negates the +necessity for triangle/tetrahedron meshing, marching cubes, ""cage meshes,"" or +any other geometry embedding, highlighting the principle of ""what you see is +what you simulate (WS$^2$)."" Our method demonstrates exceptional versatility +across a wide variety of materials--including elastic entities, metals, +non-Newtonian fluids, and granular materials--showcasing its strong +capabilities in creating diverse visual content with novel viewpoints and +movements. Our project page is at: https://xpandora.github.io/PhysGaussian/",cs.GR,"['cs.GR', 'cs.AI', 'cs.CV', 'cs.LG']" +RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding,Jihan Yang · Runyu Ding · Weipeng DENG · Zhe Wang · Xiaojuan Qi, ,https://arxiv.org/abs/2308.00353,,2308.00353.pdf,Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding,"Open-world instance-level scene understanding aims to locate and recognize +unseen object categories that are not present in the annotated dataset. This +task is challenging because the model needs to both localize novel 3D objects +and infer their semantic categories. A key factor for the recent progress in 2D +open-world perception is the availability of large-scale image-text pairs from +the Internet, which cover a wide range of vocabulary concepts. However, this +success is hard to replicate in 3D scenarios due to the scarcity of 3D-text +pairs. To address this challenge, we propose to harness pre-trained +vision-language (VL) foundation models that encode extensive knowledge from +image-text pairs to generate captions for multi-view images of 3D scenes. This +allows us to establish explicit associations between 3D shapes and +semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic +representation learning from captions for object-level categorization, we +design hierarchical point-caption association methods to learn semantic-aware +embeddings that exploit the 3D geometry between 3D points and multi-view +images. In addition, to tackle the localization challenge for novel classes in +the open-world setting, we develop debiased instance localization, which +involves training object grouping modules on unlabeled data using +instance-level pseudo supervision. This significantly improves the +generalization capabilities of instance grouping and thus the ability to +accurately locate novel objects. We conduct extensive experiments on 3D +semantic, instance, and panoptic segmentation tasks, covering indoor and +outdoor scenes across three datasets. Our method outperforms baseline methods +by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%), +instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g. +14.7%$\sim$43.3%). Code will be available.",cs.CV,['cs.CV'] +Modality-Collaborative Test-Time Adaptation for Action Recognition,Baochen Xiong · Xiaoshan Yang · Yaguang Song · Yaowei Wang · Changsheng Xu, ,,https://dl.acm.org/doi/pdf/10.1145/3581783.3611757,,,,,nan +3D Human Pose Perception from Egocentric Stereo Videos,Hiroyasu Akada · Jian Wang · Vladislav Golyanik · Christian Theobalt, ,https://arxiv.org/abs/2401.00889,,2401.00889.pdf,3D Human Pose Perception from Egocentric Stereo Videos,"While head-mounted devices are becoming more compact, they provide egocentric +views with significant self-occlusions of the device user. Hence, existing +methods often fail to accurately estimate complex 3D poses from egocentric +views. In this work, we propose a new transformer-based framework to improve +egocentric stereo 3D human pose estimation, which leverages the scene +information and temporal context of egocentric stereo videos. Specifically, we +utilize 1) depth features from our 3D scene reconstruction module with +uniformly sampled windows of egocentric stereo frames, and 2) human joint +queries enhanced by temporal features of the video inputs. Our method is able +to accurately estimate human poses even in challenging scenarios, such as +crouching and sitting. Furthermore, we introduce two new benchmark datasets, +i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a +much larger number of egocentric stereo views with a wider variety of human +motions than the existing datasets, allowing comprehensive evaluation of +existing and upcoming methods. Our extensive experiments show that the proposed +approach significantly outperforms previous methods. We will release +UnrealEgo2, UnrealEgo-RW, and trained models on our project page.",cs.CV,['cs.CV'] +Deep Generative Model based Rate-Distortion for Image Downscaling Assessment,yuanbang liang · Bhavesh Garg · Paul L. Rosin · Yipeng Qin, ,https://arxiv.org/abs/2403.15139,,2403.15139.pdf,Deep Generative Model based Rate-Distortion for Image Downscaling Assessment,"In this paper, we propose Image Downscaling Assessment by Rate-Distortion +(IDA-RD), a novel measure to quantitatively evaluate image downscaling +algorithms. In contrast to image-based methods that measure the quality of +downscaled images, ours is process-based that draws ideas from rate-distortion +theory to measure the distortion incurred during downscaling. Our main idea is +that downscaling and super-resolution (SR) can be viewed as the encoding and +decoding processes in the rate-distortion model, respectively, and that a +downscaling algorithm that preserves more details in the resulting +low-resolution (LR) images should lead to less distorted high-resolution (HR) +images in SR. In other words, the distortion should increase as the downscaling +algorithm deteriorates. However, it is non-trivial to measure this distortion +as it requires the SR algorithm to be blind and stochastic. Our key insight is +that such requirements can be met by recent SR algorithms based on deep +generative models that can find all matching HR images for a given LR image on +their learned image manifolds. Extensive experimental results show the +effectiveness of our IDA-RD measure.",cs.CV,"['cs.CV', 'eess.IV']" +Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?,Zhiqi Li · Zhiding Yu · Shiyi Lan · Jiahan Li · Jan Kautz · Tong Lu · Jose M. Alvarez, ,https://arxiv.org/abs/2312.03031,,2312.03031.pdf,Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?,"End-to-end autonomous driving recently emerged as a promising research +direction to target autonomy from a full-stack perspective. Along this line, +many of the latest works follow an open-loop evaluation setting on nuScenes to +study the planning behavior. In this paper, we delve deeper into the problem by +conducting thorough analyses and demystifying more devils in the details. We +initially observed that the nuScenes dataset, characterized by relatively +simple driving scenarios, leads to an under-utilization of perception +information in end-to-end models incorporating ego status, such as the ego +vehicle's velocity. These models tend to rely predominantly on the ego +vehicle's status for future path planning. Beyond the limitations of the +dataset, we also note that current metrics do not comprehensively assess the +planning quality, leading to potentially biased conclusions drawn from existing +benchmarks. To address this issue, we introduce a new metric to evaluate +whether the predicted trajectories adhere to the road. We further propose a +simple baseline able to achieve competitive results without relying on +perception annotations. Given the current limitations on the benchmark and +metrics, we suggest the community reassess relevant prevailing research and be +cautious whether the continued pursuit of state-of-the-art would yield +convincing and universal conclusions. Code and models are available at +\url{https://github.com/NVlabs/BEV-Planner}",cs.CV,['cs.CV'] +FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding,Jun Xiang · Xuan Gao · Yudong Guo · Juyong Zhang, ,https://arxiv.org/abs/2312.02214,,2312.02214.pdf,FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding,"We propose FlashAvatar, a novel and lightweight 3D animatable avatar +representation that could reconstruct a digital avatar from a short monocular +video sequence in minutes and render high-fidelity photo-realistic images at +300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D +Gaussian field embedded in the surface of a parametric face model and learn +extra spatial offset to model non-surface regions and subtle facial details. +While full use of geometric priors can capture high-frequency facial details +and preserve exaggerated expressions, proper initialization can help reduce the +number of Gaussians, thus enabling super-fast rendering speed. Extensive +experimental results demonstrate that FlashAvatar outperforms existing works +regarding visual quality and personalized details and is almost an order of +magnitude faster in rendering speed. Project page: +https://ustc3dv.github.io/FlashAvatar/",cs.CV,"['cs.CV', 'cs.GR']" +The Manga Whisperer: Automatically Generating Transcriptions for Comics,Ragav Sachdeva · Andrew Zisserman,https://github.com/ragavsachdeva/magi,https://arxiv.org/abs/2401.10224,,2401.10224.pdf,The Manga Whisperer: Automatically Generating Transcriptions for Comics,"In the past few decades, Japanese comics, commonly referred to as Manga, have +transcended both cultural and linguistic boundaries to become a true worldwide +sensation. Yet, the inherent reliance on visual cues and illustration within +manga renders it largely inaccessible to individuals with visual impairments. +In this work, we seek to address this substantial barrier, with the aim of +ensuring that manga can be appreciated and actively engaged by everyone. +Specifically, we tackle the problem of diarisation i.e. generating a +transcription of who said what and when, in a fully automatic way. + To this end, we make the following contributions: (1) we present a unified +model, Magi, that is able to (a) detect panels, text boxes and character boxes, +(b) cluster characters by identity (without knowing the number of clusters +apriori), and (c) associate dialogues to their speakers; (2) we propose a novel +approach that is able to sort the detected text boxes in their reading order +and generate a dialogue transcript; (3) we annotate an evaluation benchmark for +this task using publicly available [English] manga pages. The code, evaluation +datasets and the pre-trained model can be found at: +https://github.com/ragavsachdeva/magi.",cs.CV,['cs.CV'] +SNIDA: Unlocking Few-Shot Object Detection with Non-linear Semantic Decoupling Augmentation,Yanjie Wang · Xu Zou · Luxin Yan · Sheng Zhong · Jiahuan Zhou, ,https://arxiv.org/abs/2401.11140,,2401.11140.pdf,Stability Plasticity Decoupled Fine-tuning For Few-shot end-to-end Object Detection,"Few-shot object detection(FSOD) aims to design methods to adapt object +detectors efficiently with only few annotated samples. Fine-tuning has been +shown to be an effective and practical approach. However, previous works often +take the classical base-novel two stage fine-tuning procedure but ignore the +implicit stability-plasticity contradiction among different modules. +Specifically, the random re-initialized classifiers need more plasticity to +adapt to novel samples. The other modules inheriting pre-trained weights demand +more stability to reserve their class-agnostic knowledge. Regular fine-tuning +which couples the optimization of these two parts hurts the model +generalization in FSOD scenarios. In this paper, we find that this problem is +prominent in the end-to-end object detector Sparse R-CNN for its +multi-classifier cascaded architecture. We propose to mitigate this +contradiction by a new three-stage fine-tuning procedure by introducing an +addtional plasticity classifier fine-tuning(PCF) stage. We further design the +multi-source ensemble(ME) technique to enhance the generalization of the model +in the final fine-tuning stage. Extensive experiments verify that our method is +effective in regularizing Sparse R-CNN, outperforming previous methods in the +FSOD benchmark.",cs.CV,"['cs.CV', 'cs.AI']" +Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement,Xiuquan Hou · Meiqin Liu · Senlin Zhang · Ping Wei · Badong Chen,https://github.com/xiuqhou/Salience-DETR,https://arxiv.org/abs/2403.16131,,2403.16131.pdf,Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement,"DETR-like methods have significantly increased detection performance in an +end-to-end manner. The mainstream two-stage frameworks of them perform dense +self-attention and select a fraction of queries for sparse cross-attention, +which is proven effective for improving performance but also introduces a heavy +computational burden and high dependence on stable query selection. This paper +demonstrates that suboptimal two-stage selection strategies result in scale +bias and redundancy due to the mismatch between selected queries and objects in +two-stage initialization. To address these issues, we propose hierarchical +salience filtering refinement, which performs transformer encoding only on +filtered discriminative queries, for a better trade-off between computational +efficiency and precision. The filtering process overcomes scale bias through a +novel scale-independent salience supervision. To compensate for the semantic +misalignment among queries, we introduce elaborate query refinement modules for +stable two-stage initialization. Based on above improvements, the proposed +Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP +on three challenging task-specific detection datasets, as well as 49.2% AP on +COCO 2017 with less FLOPs. The code is available at +https://github.com/xiuqhou/Salience-DETR.",cs.CV,['cs.CV'] +One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion,Minghua Liu · Ruoxi Shi · Linghao Chen · Zhuoyang Zhang · Chao Xu · Xinyue Wei · Hansheng Chen · Chong Zeng · Jiayuan Gu · Hao Su,https://sudo-ai-3d.github.io/One2345plus_page/,,https://github.com/SUDO-AI-3D/One2345plus,,,,,nan +Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving,Junhao Zheng · Chenhao Lin · Jiahao Sun · Zhengyu Zhao · Qian Li · Chao Shen,https://github.com/gandolfczjh/3d2fool,https://arxiv.org/abs/2403.17301,,2403.17301.pdf,Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving,"Deep learning-based monocular depth estimation (MDE), extensively applied in +autonomous driving, is known to be vulnerable to adversarial attacks. Previous +physical attacks against MDE models rely on 2D adversarial patches, so they +only affect a small, localized region in the MDE map but fail under various +viewpoints. To address these limitations, we propose 3D Depth Fool +(3D$^2$Fool), the first 3D texture-based adversarial attack against MDE models. +3D$^2$Fool is specifically optimized to generate 3D adversarial textures +agnostic to model types of vehicles and to have improved robustness in bad +weather conditions, such as rain and fog. Experimental results validate the +superior performance of our 3D$^2$Fool across various scenarios, including +vehicles, MDE models, weather conditions, and viewpoints. Real-world +experiments with printed 3D textures on physical vehicle models further +demonstrate that our 3D$^2$Fool can cause an MDE error of over 10 meters.",cs.CV,"['cs.CV', 'cs.CR']" +VecFusion: Vector Font Generation with Diffusion,Vikas Thamizharasan · Difan Liu · Shantanu Agarwal · Matthew Fisher · Michaël Gharbi · Oliver Wang · Alec Jacobson · Evangelos Kalogerakis, ,https://arxiv.org/abs/2312.10540,,2312.10540.pdf,VecFusion: Vector Font Generation with Diffusion,"We present VecFusion, a new neural architecture that can generate vector +fonts with varying topological structures and precise control point positions. +Our approach is a cascaded diffusion model which consists of a raster diffusion +model followed by a vector diffusion model. The raster model generates +low-resolution, rasterized fonts with auxiliary control point information, +capturing the global style and shape of the font, while the vector model +synthesizes vector fonts conditioned on the low-resolution raster fonts from +the first stage. To synthesize long and complex curves, our vector diffusion +model uses a transformer architecture and a novel vector representation that +enables the modeling of diverse vector geometry and the precise prediction of +control points. Our experiments show that, in contrast to previous generative +models for vector graphics, our new cascaded vector diffusion model generates +higher quality vector fonts, with complex structures and diverse styles.",cs.CV,"['cs.CV', 'cs.GR']" +LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection,Dat NGUYEN · Nesryne Mejri · Inder Pal Singh · Polina Kuleshova · Marcella Astrid · Anis Kacem · Enjie Ghorbel · Djamila Aouada,https://github.com/10Ring/LAA-Net,https://arxiv.org/abs/2401.13856,,2401.13856.pdf,LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection,"This paper introduces a novel approach for high-quality deepfake detection +called Localized Artifact Attention Network (LAA-Net). Existing methods for +high-quality deepfake detection are mainly based on a supervised binary +classifier coupled with an implicit attention mechanism. As a result, they do +not generalize well to unseen manipulations. To handle this issue, two main +contributions are made. First, an explicit attention mechanism within a +multi-task learning framework is proposed. By combining heatmap-based and +self-consistency attention strategies, LAA-Net is forced to focus on a few +small artifact-prone vulnerable regions. Second, an Enhanced Feature Pyramid +Network (E-FPN) is proposed as a simple and effective mechanism for spreading +discriminative low-level features into the final feature output, with the +advantage of limiting redundancy. Experiments performed on several benchmarks +show the superiority of our approach in terms of Area Under the Curve (AUC) and +Average Precision (AP). The code is available at +https://github.com/10Ring/LAA-Net.",cs.CV,['cs.CV'] +SAI3D: Segment Any Instance in 3D Scenes,Yingda Yin · Yuzheng Liu · Yang Xiao · Daniel Cohen-Or · Jingwei Huang · Baoquan Chen,https://yd-yin.github.io/SAI3D/,https://arxiv.org/abs/2312.11557,,,SAI3D: Segment Any Instance in 3D Scenes,"Advancements in 3D instance segmentation have traditionally been tethered to +the availability of annotated datasets, limiting their application to a narrow +spectrum of object categories. Recent efforts have sought to harness +vision-language models like CLIP for open-set semantic reasoning, yet these +methods struggle to distinguish between objects of the same categories and rely +on specific prompts that are not universally applicable. In this paper, we +introduce SAI3D, a novel zero-shot 3D instance segmentation approach that +synergistically leverages geometric priors and semantic cues derived from +Segment Anything Model (SAM). Our method partitions a 3D scene into geometric +primitives, which are then progressively merged into 3D instance segmentations +that are consistent with the multi-view SAM masks. Moreover, we design a +hierarchical region-growing algorithm with a dynamic thresholding mechanism, +which largely improves the robustness of finegrained 3D scene parsing.Empirical +evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ +datasets demonstrate the superiority of our approach. Notably, SAI3D +outperforms existing open-vocabulary baselines and even surpasses +fully-supervised methods in class-agnostic segmentation on ScanNet++. Our +project page is at https://yd-yin.github.io/SAI3D.",cs.CV,['cs.CV'] +InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models,Jiun Tian Hoe · Xudong Jiang · Chee Seng Chan · Yap-peng Tan · Weipeng Hu,https://jiuntian.github.io/interactdiffusion/,https://arxiv.org/abs/2312.05849,,2312.05849.pdf,InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models,"Large-scale text-to-image (T2I) diffusion models have showcased incredible +capabilities in generating coherent images based on textual descriptions, +enabling vast applications in content generation. While recent advancements +have introduced control over factors such as object localization, posture, and +image contours, a crucial gap remains in our ability to control the +interactions between objects in the generated content. Well-controlling +interactions in generated images could yield meaningful applications, such as +creating realistic scenes with interacting characters. In this work, we study +the problems of conditioning T2I diffusion models with Human-Object Interaction +(HOI) information, consisting of a triplet label (person, action, object) and +corresponding bounding boxes. We propose a pluggable interaction control model, +called InteractDiffusion that extends existing pre-trained T2I diffusion models +to enable them being better conditioned on interactions. Specifically, we +tokenize the HOI information and learn their relationships via interaction +embeddings. A conditioning self-attention layer is trained to map HOI tokens to +visual tokens, thereby conditioning the visual tokens better in existing T2I +diffusion models. Our model attains the ability to control the interaction and +location on existing T2I diffusion models, which outperforms existing baselines +by a large margin in HOI detection score, as well as fidelity in FID and KID. +Project page: https://jiuntian.github.io/interactdiffusion.",cs.CV,"['cs.CV', 'cs.GR', 'cs.MM']" +G3DR: Generative 3D Reconstruction in ImageNet,Pradyumna Reddy · Ismail Elezi · Jiankang Deng,https://preddy5.github.io/g3dr_website/,https://arxiv.org/abs/2403.00939,,2403.00939.pdf,G3DR: Generative 3D Reconstruction in ImageNet,"We introduce a novel 3D generative method, Generative 3D Reconstruction +(G3DR) in ImageNet, capable of generating diverse and high-quality 3D objects +from single images, addressing the limitations of existing methods. At the +heart of our framework is a novel depth regularization technique that enables +the generation of scenes with high-geometric fidelity. G3DR also leverages a +pretrained language-vision model, such as CLIP, to enable reconstruction in +novel views and improve the visual realism of generations. Additionally, G3DR +designs a simple but effective sampling procedure to further improve the +quality of generations. G3DR offers diverse and efficient 3D asset generation +based on class or text conditioning. Despite its simplicity, G3DR is able to +beat state-of-theart methods, improving over them by up to 22% in perceptual +metrics and 90% in geometry scores, while needing only half of the training +time. Code is available at https://github.com/preddy5/G3DR",cs.CV,"['cs.CV', 'cs.GR']" +ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining,Ruoxi Shi · Xinyue Wei · Cheng Wang · Hao Su, ,https://arxiv.org/abs/2312.09249,,2312.09249.pdf,ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining,"We present ZeroRF, a novel per-scene optimization method addressing the +challenge of sparse view 360{\deg} reconstruction in neural field +representations. Current breakthroughs like Neural Radiance Fields (NeRF) have +demonstrated high-fidelity image synthesis but struggle with sparse input +views. Existing methods, such as Generalizable NeRFs and per-scene optimization +approaches, face limitations in data dependency, computational cost, and +generalization across diverse scenarios. To overcome these challenges, we +propose ZeroRF, whose key idea is to integrate a tailored Deep Image Prior into +a factorized NeRF representation. Unlike traditional methods, ZeroRF +parametrizes feature grids with a neural network generator, enabling efficient +sparse view 360{\deg} reconstruction without any pretraining or additional +regularization. Extensive experiments showcase ZeroRF's versatility and +superiority in terms of both quality and speed, achieving state-of-the-art +results on benchmark datasets. ZeroRF's significance extends to applications in +3D content generation and editing. Project page: +https://sarahweiii.github.io/zerorf/",cs.CV,"['cs.CV', 'cs.GR']" +HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data,Qifan Yu · Juncheng Li · Longhui Wei · Liang Pang · Wentao Ye · Bosheng Qin · Siliang Tang · Qi Tian · Yueting Zhuang, ,https://arxiv.org/abs/2311.13614,,2311.13614.pdf,HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data,"Multi-modal Large Language Models (MLLMs) tuned on machine-generated +instruction-following data have demonstrated remarkable performance in various +multi-modal understanding and generation tasks. However, the hallucinations +inherent in machine-generated data, which could lead to hallucinatory outputs +in MLLMs, remain under-explored. This work aims to investigate various +hallucinations (i.e., object, relation, attribute hallucinations) and mitigate +those hallucinatory toxicities in large-scale machine-generated visual +instruction datasets. Drawing on the human ability to identify factual errors, +we present a novel hallucination detection and elimination framework, +HalluciDoctor, based on the cross-checking paradigm. We use our framework to +identify and eliminate hallucinations in the training data automatically. +Interestingly, HalluciDoctor also indicates that spurious correlations arising +from long-tail object co-occurrences contribute to hallucinations. Based on +that, we execute counterfactual visual instruction expansion to balance data +distribution, thereby enhancing MLLMs' resistance to hallucinations. +Comprehensive experiments on hallucination evaluation benchmarks show that our +method successfully mitigates 44.6% hallucinations relatively and maintains +competitive performance compared to LLaVA. The data and code for this paper are +publicly available. \url{https://github.com/Yuqifan1117/HalluciDoctor}.",cs.CV,"['cs.CV', 'cs.AI']" +Mudslide: A Universal Nuclear Instance Segmentation Method,Jun Wang, ,https://arxiv.org/abs/2311.15939,,2311.15939.pdf,Unleashing the Power of Prompt-driven Nucleus Instance Segmentation,"Nucleus instance segmentation in histology images is crucial for a broad +spectrum of clinical applications. Current dominant algorithms rely on +regression of nuclear proxy maps. Distinguishing nucleus instances from the +estimated maps requires carefully curated post-processing, which is error-prone +and parameter-sensitive. Recently, the Segment Anything Model (SAM) has earned +huge attention in medical image segmentation, owing to its impressive +generalization ability and promptable property. Nevertheless, its potential on +nucleus instance segmentation remains largely underexplored. In this paper, we +present a novel prompt-driven framework that consists of a nucleus prompter and +SAM for automatic nucleus instance segmentation. Specifically, the prompter +learns to generate a unique point prompt for each nucleus while the SAM is +fine-tuned to output the corresponding mask for the prompted nucleus. +Furthermore, we propose the inclusion of adjacent nuclei as negative prompts to +enhance the model's capability to identify overlapping nuclei. Without +complicated post-processing, our proposed method sets a new state-of-the-art +performance on three challenging benchmarks. Code is available at +\url{github.com/windygoo/PromptNucSeg}",cs.CV,['cs.CV'] +MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models,Yanting Wang · Hongye Fu · Wei Zou · Jinyuan Jia, ,https://arxiv.org/abs/2403.19080,,2403.19080.pdf,MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models,"Different from a unimodal model whose input is from a single modality, the +input (called multi-modal input) of a multi-modal model is from multiple +modalities such as image, 3D points, audio, text, etc. Similar to unimodal +models, many existing studies show that a multi-modal model is also vulnerable +to adversarial perturbation, where an attacker could add small perturbation to +all modalities of a multi-modal input such that the multi-modal model makes +incorrect predictions for it. Existing certified defenses are mostly designed +for unimodal models, which achieve sub-optimal certified robustness guarantees +when extended to multi-modal models as shown in our experimental results. In +our work, we propose MMCert, the first certified defense against adversarial +attacks to a multi-modal model. We derive a lower bound on the performance of +our MMCert under arbitrary adversarial attacks with bounded perturbations to +both modalities (e.g., in the context of auto-driving, we bound the number of +changed pixels in both RGB image and depth image). We evaluate our MMCert using +two benchmark datasets: one for the multi-modal road segmentation task and the +other for the multi-modal emotion recognition task. Moreover, we compare our +MMCert with a state-of-the-art certified defense extended from unimodal models. +Our experimental results show that our MMCert outperforms the baseline.",cs.CV,"['cs.CV', 'cs.CR']" +NTO3D: Neural Target Object 3D Reconstruction with Segment Anything,Xiaobao Wei · Renrui Zhang · Jiarui Wu · Jiaming Liu · Ming Lu · Yandong Guo · Shanghang Zhang, ,https://arxiv.org/abs/2309.12790,,2309.12790.pdf,NTO3D: Neural Target Object 3D Reconstruction with Segment Anything,"Neural 3D reconstruction from multi-view images has recently attracted +increasing attention from the community. Existing methods normally learn a +neural field for the whole scene, while it is still under-explored how to +reconstruct a target object indicated by users. Considering the Segment +Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in +this paper, we propose NTO3D, a novel high-quality Neural Target Object 3D +(NTO3D) reconstruction method, which leverages the benefits of both neural +field and SAM. We first propose a novel strategy to lift the multi-view 2D +segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy +field is then projected into 2D space and generates the new prompts for SAM. +This process is iterative until convergence to separate the target object from +the scene. After this, we then lift the 2D features of the SAM encoder into a +3D feature field in order to improve the reconstruction quality of the target +object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field +for high-quality neural target object 3D reconstruction. We conduct detailed +experiments on several benchmark datasets to demonstrate the advantages of our +method. The code will be available at: https://github.com/ucwxb/NTO3D.",cs.CV,['cs.CV'] +A Bayesian Approach to OOD Robustness in Image Classification,Prakhar Kaushik · Adam Kortylewski · Alan L. Yuille, ,https://arxiv.org/abs/2403.07277v1,,2403.07277v1.pdf,A Bayesian Approach to OOD Robustness in Image Classification,"An important and unsolved problem in computer vision is to ensure that the +algorithms are robust to changes in image domains. We address this problem in +the scenario where we have access to images from the target domains but no +annotations. Motivated by the challenges of the OOD-CV benchmark where we +encounter real world Out-of-Domain (OOD) nuisances and occlusion, we introduce +a novel Bayesian approach to OOD robustness for object classification. Our work +extends Compositional Neural Networks (CompNets), which have been shown to be +robust to occlusion but degrade badly when tested on OOD data. We exploit the +fact that CompNets contain a generative head defined over feature vectors +represented by von Mises-Fisher (vMF) kernels, which correspond roughly to +object parts, and can be learned without supervision. We obverse that some vMF +kernels are similar between different domains, while others are not. This +enables us to learn a transitional dictionary of vMF kernels that are +intermediate between the source and target domains and train the generative +model on this dictionary using the annotations on the source domain, followed +by iterative refinement. This approach, termed Unsupervised Generative +Transition (UGT), performs very well in OOD scenarios even when occlusion is +present. UGT is evaluated on different OOD benchmarks including the OOD-CV +dataset, several popular datasets (e.g., ImageNet-C [9]), artificial image +corruptions (including adding occluders), and synthetic-to-real domain +transfer, and does well in all scenarios outperforming SOTA alternatives (e.g. +up to 10% top-1 accuracy on Occluded OOD-CV dataset).",cs.CV,"['cs.CV', 'cs.AI']" +SNI-SLAM: Semantic Neural Implicit SLAM,Siting Zhu · Guangming Wang · Hermann Blum · Jiuming Liu · LiangSong · Marc Pollefeys · Hesheng Wang, ,https://arxiv.org/abs/2311.11016,,2311.11016.pdf,SNI-SLAM: Semantic Neural Implicit SLAM,"We propose SNI-SLAM, a semantic SLAM system utilizing neural implicit +representation, that simultaneously performs accurate semantic mapping, +high-quality surface reconstruction, and robust camera tracking. In this +system, we introduce hierarchical semantic representation to allow multi-level +semantic comprehension for top-down structured semantic mapping of the scene. +In addition, to fully utilize the correlation between multiple attributes of +the environment, we integrate appearance, geometry and semantic features +through cross-attention for feature collaboration. This strategy enables a more +multifaceted understanding of the environment, thereby allowing SNI-SLAM to +remain robust even when single attribute is defective. Then, we design an +internal fusion-based decoder to obtain semantic, RGB, Truncated Signed +Distance Field (TSDF) values from multi-level features for accurate decoding. +Furthermore, we propose a feature loss to update the scene representation at +the feature level. Compared with low-level losses such as RGB loss and depth +loss, our feature loss is capable of guiding the network optimization on a +higher-level. Our SNI-SLAM method demonstrates superior performance over all +recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on +Replica and ScanNet datasets, while also showing excellent capabilities in +accurate semantic segmentation and real-time semantic mapping.",cs.RO,['cs.RO'] +PBWR: Parametric Building Wireframe Reconstruction from Aerial LiDAR Point Clouds,Shangfeng Huang · Ruisheng Wang · Bo Guo · Hongxin Yang, ,https://arxiv.org/abs/2311.12062,,2311.12062.pdf,PBWR: Parametric Building Wireframe Reconstruction from Aerial LiDAR Point Clouds,"In this paper, we present an end-to-end 3D building wireframe reconstruction +method to regress edges directly from aerial LiDAR point clouds.Our method, +named Parametric Building Wireframe Reconstruction (PBWR), takes aerial LiDAR +point clouds and initial edge entities as input, and fully uses self-attention +mechanism of transformers to regress edge parameters without any intermediate +steps such as corner prediction. We propose an edge non-maximum suppression +(E-NMS) module based on edge similarityto remove redundant edges. Additionally, +a dedicated edge loss function is utilized to guide the PBWR in regressing +edges parameters, where simple use of edge distance loss isn't suitable. In our +experiments, we demonstrate state-of-the-art results on the Building3D dataset, +achieving an improvement of approximately 36% in entry-level dataset edge +accuracy and around 42% improvement in the Tallinn dataset.",cs.CV,"['cs.CV', 'cs.AI']" +Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling,Zhe Li · Zerong Zheng · Lizhen Wang · Yebin Liu,https://animatable-gaussians.github.io/,https://arxiv.org/abs/2311.16096,,2311.16096.pdf,Animatable and Relightable Gaussians for High-fidelity Human Avatar Modeling,"Modeling animatable human avatars from RGB videos is a long-standing and +challenging problem. Recent works usually adopt MLP-based neural radiance +fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to +regress pose-dependent garment details. To this end, we introduce Animatable +Gaussians, a new avatar representation that leverages powerful 2D CNNs and 3D +Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians +with the animatable avatar, we learn a parametric template from the input +videos, and then parameterize the template on two front & back canonical +Gaussian maps where each pixel represents a 3D Gaussian. The learned template +is adaptive to the wearing garments for modeling looser clothes like dresses. +Such template-guided 2D parameterization enables us to employ a powerful +StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling +detailed dynamic appearances. Furthermore, we introduce a pose projection +strategy for better generalization given novel poses. To tackle the realistic +relighting of animatable avatars, we introduce physically-based rendering into +the avatar representation for decomposing avatar materials and environment +illumination. Overall, our method can create lifelike avatars with dynamic, +realistic, generalized and relightable appearances. Experiments show that our +method outperforms other state-of-the-art approaches.",cs.CV,"['cs.CV', 'cs.GR']" +Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal,Yijun Yang · Hongtao Wu · Angelica I. Aviles-Rivero · Yulun Zhang · Jing Qin · Lei Zhu, ,https://arxiv.org/abs/2403.07684,,2403.07684.pdf,Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal,"Real-world vision tasks frequently suffer from the appearance of unexpected +adverse weather conditions, including rain, haze, snow, and raindrops. In the +last decade, convolutional neural networks and vision transformers have yielded +outstanding results in single-weather video removal. However, due to the +absence of appropriate adaptation, most of them fail to generalize to other +weather conditions. Although ViWS-Net is proposed to remove adverse weather +conditions in videos with a single set of pre-trained weights, it is seriously +blinded by seen weather at train-time and degenerates when coming to unseen +weather during test-time. In this work, we introduce test-time adaptation into +adverse weather removal in videos, and propose the first framework that +integrates test-time adaptation into the iterative diffusion reverse process. +Specifically, we devise a diffusion-based network with a novel temporal noise +model to efficiently explore frame-correlated information in degraded video +clips at training stage. During inference stage, we introduce a proxy task +named Diffusion Tubelet Self-Calibration to learn the primer distribution of +test video stream and optimize the model by approximating the temporal noise +model for online adaptation. Experimental results, on benchmark datasets, +demonstrate that our Test-Time Adaptation method with Diffusion-based +network(Diff-TTA) outperforms state-of-the-art methods in terms of restoring +videos degraded by seen weather conditions. Its generalizable capability is +also validated with unseen weather conditions in both synthesized and +real-world videos.",cs.CV,['cs.CV'] +Generalizable Novel-View Synthesis using a Stereo Camera,Haechan Lee · Wonjoon Jin · Seung-Hwan Baek · Sunghyun Cho,https://jinwonjoon.github.io/stereonerf/,https://arxiv.org/abs/2404.13541,,2404.13541.pdf,Generalizable Novel-View Synthesis using a Stereo Camera,"In this paper, we propose the first generalizable view synthesis approach +that specifically targets multi-view stereo-camera images. Since recent stereo +matching has demonstrated accurate geometry prediction, we introduce stereo +matching into novel-view synthesis for high-quality geometry reconstruction. To +this end, this paper proposes a novel framework, dubbed StereoNeRF, which +integrates stereo matching into a NeRF-based generalizable view synthesis +approach. StereoNeRF is equipped with three key components to effectively +exploit stereo matching in novel-view synthesis: a stereo feature extractor, a +depth-guided plane-sweeping, and a stereo depth loss. Moreover, we propose the +StereoNVS dataset, the first multi-view dataset of stereo-camera images, +encompassing a wide variety of both real and synthetic scenes. Our experimental +results demonstrate that StereoNeRF surpasses previous approaches in +generalizable view synthesis.",cs.CV,['cs.CV'] +PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos,Yufei Zhang · Jeffrey Kephart · Zijun Cui · Qiang Ji, ,https://arxiv.org/abs/2404.04430,,2404.04430.pdf,PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos,"While current methods have shown promising progress on estimating 3D human +motion from monocular videos, their motion estimates are often physically +unrealistic because they mainly consider kinematics. In this paper, we +introduce Physics-aware Pretrained Transformer (PhysPT), which improves +kinematics-based motion estimates and infers motion forces. PhysPT exploits a +Transformer encoder-decoder backbone to effectively learn human dynamics in a +self-supervised manner. Moreover, it incorporates physics principles governing +human motion. Specifically, we build a physics-based body representation and +contact force model. We leverage them to impose novel physics-inspired training +losses (i.e., force loss, contact loss, and Euler-Lagrange loss), enabling +PhysPT to capture physical properties of the human body and the forces it +experiences. Experiments demonstrate that, once trained, PhysPT can be directly +applied to kinematics-based estimates to significantly enhance their physical +plausibility and generate favourable motion forces. Furthermore, we show that +these physically meaningful quantities translate into improved accuracy of an +important downstream task: human action recognition.",cs.CV,['cs.CV'] +Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors,Nicolae Ristea · Florinel Croitoru · Radu Tudor Ionescu · Marius Popescu · Fahad Shahbaz Khan · Mubarak Shah,https://github.com/ristea/aed-mae/tree/main,https://arxiv.org/abs/2306.12041v2,,2306.12041v2.pdf,Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors,"We propose an efficient abnormal event detection model based on a lightweight +masked auto-encoder (AE) applied at the video frame level. The novelty of the +proposed model is threefold. First, we introduce an approach to weight tokens +based on motion gradients, thus shifting the focus from the static background +scene to the foreground objects. Second, we integrate a teacher decoder and a +student decoder into our architecture, leveraging the discrepancy between the +outputs given by the two decoders to improve anomaly detection. Third, we +generate synthetic abnormal events to augment the training videos, and task the +masked AE model to jointly reconstruct the original frames (without anomalies) +and the corresponding pixel-level anomaly maps. Our design leads to an +efficient and effective model, as demonstrated by the extensive experiments +carried out on four benchmarks: Avenue, ShanghaiTech, UBnormal and UCSD Ped2. +The empirical results show that our model achieves an excellent trade-off +between speed and accuracy, obtaining competitive AUC scores, while processing +1655 FPS. Hence, our model is between 8 and 70 times faster than competing +methods. We also conduct an ablation study to justify our design. Our code is +freely available at: https://github.com/ristea/aed-mae.",cs.CV,"['cs.CV', 'cs.LG']" +Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization,Deng Li · Aming Wu · Yaowei Wang · Yahong Han, ,https://arxiv.org/abs/2402.18447,,2402.18447.pdf,Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization,"Single-domain generalization aims to learn a model from single source domain +data to achieve generalized performance on other unseen target domains. +Existing works primarily focus on improving the generalization ability of +static networks. However, static networks are unable to dynamically adapt to +the diverse variations in different image scenes, leading to limited +generalization capability. Different scenes exhibit varying levels of +complexity, and the complexity of images further varies significantly in +cross-domain scenarios. In this paper, we propose a dynamic object-centric +perception network based on prompt learning, aiming to adapt to the variations +in image complexity. Specifically, we propose an object-centric gating module +based on prompt learning to focus attention on the object-centric features +guided by the various scene prompts. Then, with the object-centric gating +masks, the dynamic selective module dynamically selects highly correlated +feature regions in both spatial and channel dimensions enabling the model to +adaptively perceive object-centric relevant features, thereby enhancing the +generalization capability. Extensive experiments were conducted on +single-domain generalization tasks in image classification and object +detection. The experimental results demonstrate that our approach outperforms +state-of-the-art methods, which validates the effectiveness and generally of +our proposed method.",cs.CV,['cs.CV'] +PairAug: What Can Augmented Image-Text Pairs Do for Radiology?,Yutong Xie · Qi Chen · Sinuo Wang · Minh-Son To · Iris Lee · Ee Win Khoo · Kerolos Hendy · Daniel Koh · Yong Xia · Qi Wu, ,https://arxiv.org/abs/2404.04960,,2404.04960.pdf,PairAug: What Can Augmented Image-Text Pairs Do for Radiology?,"Current vision-language pre-training (VLP) methodologies predominantly depend +on paired image-text datasets, a resource that is challenging to acquire in +radiology due to privacy considerations and labelling complexities. Data +augmentation provides a practical solution to overcome the issue of data +scarcity, however, most augmentation methods exhibit a limited focus, +prioritising either image or text augmentation exclusively. Acknowledging this +limitation, our objective is to devise a framework capable of concurrently +augmenting medical image and text data. We design a Pairwise Augmentation +(PairAug) approach that contains an Inter-patient Augmentation (InterAug) +branch and an Intra-patient Augmentation (IntraAug) branch. Specifically, the +InterAug branch of our approach generates radiology images using synthesised +yet plausible reports derived from a Large Language Model (LLM). The generated +pairs can be considered a collection of new patient cases since they are +artificially created and may not exist in the original dataset. In contrast, +the IntraAug branch uses newly generated reports to manipulate images. This +process allows us to create new paired data for each individual with diverse +medical conditions. Our extensive experiments on various downstream tasks +covering medical image classification zero-shot and fine-tuning analysis +demonstrate that our PairAug, concurrently expanding both image and text data, +substantially outperforms image-/text-only expansion baselines and advanced +medical VLP baselines. Our code is released at +\url{https://github.com/YtongXie/PairAug}.",cs.CV,['cs.CV'] +CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning,Lianggangxu Chen · Xuejiao Wang · Jiale Lu · Shaohui Lin · Changbo Wang · Gaoqi He, ,https://arxiv.org/abs/2309.16650,,2309.16650.pdf,ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning,"For robots to perform a wide variety of tasks, they require a 3D +representation of the world that is semantically rich, yet compact and +efficient for task-driven perception and planning. Recent approaches have +attempted to leverage features from large vision-language models to encode +semantics in 3D representations. However, these approaches tend to produce maps +with per-point feature vectors, which do not scale well in larger environments, +nor do they contain semantic spatial relationships between entities in the +environment, which are useful for downstream planning. In this work, we propose +ConceptGraphs, an open-vocabulary graph-structured representation for 3D +scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing +their output to 3D by multi-view association. The resulting representations +generalize to novel semantic classes, without the need to collect large 3D +datasets or finetune models. We demonstrate the utility of this representation +through a number of downstream planning tasks that are specified through +abstract (language) prompts and require complex reasoning over spatial and +semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer +video: https://youtu.be/mRhNkQwRYnc )",cs.RO,"['cs.RO', 'cs.CV']" +Initialization Matters for Adversarial Transfer Learning,Andong Hua · Jindong Gu · Zhiyu Xue · Nicholas Carlini · Eric Wong · Yao Qin, ,https://arxiv.org/abs/2312.05716,,2312.05716.pdf,Initialization Matters for Adversarial Transfer Learning,"With the prevalence of the Pretraining-Finetuning paradigm in transfer +learning, the robustness of downstream tasks has become a critical concern. In +this work, we delve into adversarial robustness in transfer learning and reveal +the critical role of initialization, including both the pretrained model and +the linear head. First, we discover the necessity of an adversarially robust +pretrained model. Specifically, we reveal that with a standard pretrained +model, Parameter-Efficient Finetuning (PEFT) methods either fail to be +adversarially robust or continue to exhibit significantly degraded adversarial +robustness on downstream tasks, even with adversarial training during +finetuning. Leveraging a robust pretrained model, surprisingly, we observe that +a simple linear probing can outperform full finetuning and other PEFT methods +with random initialization on certain datasets. We further identify that linear +probing excels in preserving robustness from the robust pretraining. Based on +this, we propose Robust Linear Initialization (RoLI) for adversarial +finetuning, which initializes the linear head with the weights obtained by +adversarial linear probing to maximally inherit the robustness from +pretraining. Across five different image classification datasets, we +demonstrate the effectiveness of RoLI and achieve new state-of-the-art results. +Our code is available at \url{https://github.com/DongXzz/RoLI}.",cs.CV,['cs.CV'] +PEGASUS: Personalized Generative 3D Avatars with Composable Attributes,Hyunsoo Cha · Byungjun Kim · Hanbyul Joo, ,https://arxiv.org/abs/2402.10636,,2402.10636.pdf,PEGASUS: Personalized Generative 3D Avatars with Composable Attributes,"We present PEGASUS, a method for constructing a personalized generative 3D +face avatar from monocular video sources. Our generative 3D avatar enables +disentangled controls to selectively alter the facial attributes (e.g., hair or +nose) while preserving the identity. Our approach consists of two stages: +synthetic database generation and constructing a personalized generative +avatar. We generate a synthetic video collection of the target identity with +varying facial attributes, where the videos are synthesized by borrowing the +attributes from monocular videos of diverse identities. Then, we build a +person-specific generative 3D avatar that can modify its attributes +continuously while preserving its identity. Through extensive experiments, we +demonstrate that our method of generating a synthetic database and creating a +3D generative avatar is the most effective in preserving identity while +achieving high realism. Subsequently, we introduce a zero-shot approach to +achieve the same goal of generative modeling more efficiently by leveraging a +previously constructed personalized generative model.",cs.CV,['cs.CV'] +FedHCA$^2$: Towards Hetero-Client Federated Multi-Task Learning,Yuxiang Lu · Suizhi Huang · Yuwen Yang · Shalayiding Sirejiding · Yue Ding · Hongtao Lu,https://github.com/innovator-zero/FedHCA2,https://arxiv.org/abs/2311.13250v2,,2311.13250v2.pdf,FedHCA$^2$: Towards Hetero-Client Federated Multi-Task Learning,"Federated Learning (FL) enables joint training across distributed clients +using their local data privately. Federated Multi-Task Learning (FMTL) builds +on FL to handle multiple tasks, assuming model congruity that identical model +architecture is deployed in each client. To relax this assumption and thus +extend real-world applicability, we introduce a novel problem setting, +Hetero-Client Federated Multi-Task Learning (HC-FMTL), to accommodate diverse +task setups. The main challenge of HC-FMTL is the model incongruity issue that +invalidates conventional aggregation methods. It also escalates the +difficulties in accurate model aggregation to deal with data and task +heterogeneity inherent in FMTL. To address these challenges, we propose the +FedHCA$^2$ framework, which allows for federated training of personalized +models by modeling relationships among heterogeneous clients. Drawing on our +theoretical insights into the difference between multi-task and federated +optimization, we propose the Hyper Conflict-Averse Aggregation scheme to +mitigate conflicts during encoder updates. Additionally, inspired by task +interaction in MTL, the Hyper Cross Attention Aggregation scheme uses +layer-wise cross attention to enhance decoder interactions while alleviating +model incongruity. Moreover, we employ learnable Hyper Aggregation Weights for +each client to customize personalized parameter updates. Extensive experiments +demonstrate the superior performance of FedHCA$^2$ in various HC-FMTL scenarios +compared to representative methods. Our code will be made publicly available.",cs.CV,"['cs.CV', 'cs.LG']" +Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation,Jonas Herzog, ,https://arxiv.org/abs/2402.17614,,2402.17614.pdf,Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation,"Few-shot segmentation performance declines substantially when facing images +from a domain different than the training domain, effectively limiting +real-world use cases. To alleviate this, recently cross-domain few-shot +segmentation (CD-FSS) has emerged. Works that address this task mainly +attempted to learn segmentation on a source domain in a manner that generalizes +across domains. Surprisingly, we can outperform these approaches while +eliminating the training stage and removing their main segmentation network. We +show test-time task-adaption is the key for successful CD-FSS instead. +Task-adaption is achieved by appending small networks to the feature pyramid of +a conventionally classification-pretrained backbone. To avoid overfitting to +the few labeled samples in supervised fine-tuning, consistency across augmented +views of input images serves as guidance while learning the parameters of the +attached layers. Despite our self-restriction not to use any images other than +the few labeled samples at test time, we achieve new state-of-the-art +performance in CD-FSS, evidencing the need to rethink approaches for the task.",cs.CV,['cs.CV'] +TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models,Haomiao Ni · Bernhard Egger · Suhas Lohit · Anoop Cherian · Ye Wang · Toshiaki Koike-Akino · Sharon X. Huang · Tim Marks, ,https://arxiv.org/abs/2404.16306,,2404.16306.pdf,TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models,"Text-conditioned image-to-video generation (TI2V) aims to synthesize a +realistic video starting from a given image (e.g., a woman's photo) and a text +description (e.g., ""a woman is drinking water.""). Existing TI2V frameworks +often require costly training on video-text datasets and specific model designs +for text and image conditioning. In this paper, we propose TI2V-Zero, a +zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) +diffusion model to be conditioned on a provided image, enabling TI2V generation +without any optimization, fine-tuning, or introducing external modules. Our +approach leverages a pretrained T2V diffusion foundation model as the +generative prior. To guide video generation with the additional image input, we +propose a ""repeat-and-slide"" strategy that modulates the reverse denoising +process, allowing the frozen diffusion model to synthesize a video +frame-by-frame starting from the provided image. To ensure temporal continuity, +we employ a DDPM inversion strategy to initialize Gaussian noise for each newly +synthesized frame and a resampling technique to help preserve visual details. +We conduct comprehensive experiments on both domain-specific and open-domain +datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V +model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks +such as video infilling and prediction when provided with more images. Its +autoregressive design also supports long video generation.",cs.CV,['cs.CV'] +Atom-Level Optical Chemical Structure Recognition with Limited Supervision,Martijn Oldenhof · Edward De Brouwer · Adam Arany · Yves Moreau,https://github.com/molden/atomlenz,https://arxiv.org/abs/2404.01743,,2404.01743.pdf,Atom-Level Optical Chemical Structure Recognition with Limited Supervision,"Identifying the chemical structure from a graphical representation, or image, +of a molecule is a challenging pattern recognition task that would greatly +benefit drug development. Yet, existing methods for chemical structure +recognition do not typically generalize well, and show diminished effectiveness +when confronted with domains where data is sparse, or costly to generate, such +as hand-drawn molecule images. To address this limitation, we propose a new +chemical structure recognition tool that delivers state-of-the-art performance +and can adapt to new domains with a limited number of data samples and +supervision. Unlike previous approaches, our method provides atom-level +localization, and can therefore segment the image into the different atoms and +bonds. Our model is the first model to perform OCSR with atom-level entity +detection with only SMILES supervision. Through rigorous and extensive +benchmarking, we demonstrate the preeminence of our chemical structure +recognition approach in terms of data efficiency, accuracy, and atom-level +entity prediction.",cs.CV,['cs.CV'] +SubT-MRS Datasets: Pushing SLAM Towards All-weather Environments,Shibo Zhao · Yuanjun Gao · Tianhao Wu · Damanpreet Singh · Rushan Jiang · Haoxiang Sun · Mansi Sarawata · Warren Whittaker · Ian Higgins · Shaoshu Su · Yi Du · Can Xu · John Keller · Jay Karhade · Lucas Nogueira · Sourojit Saha · Yuheng Qiu · Ji Zhang · Wenshan Wang · Chen Wang · Sebastian Scherer,https://superodometry.com/datasets,https://arxiv.org/abs/2307.07607,,2307.07607.pdf,SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments,"Simultaneous localization and mapping (SLAM) is a fundamental task for +numerous applications such as autonomous navigation and exploration. Despite +many SLAM datasets have been released, current SLAM solutions still struggle to +have sustained and resilient performance. One major issue is the absence of +high-quality datasets including diverse all-weather conditions and a reliable +metric for assessing robustness. This limitation significantly restricts the +scalability and generalizability of SLAM technologies, impacting their +development, validation, and deployment. To address this problem, we present +SubT-MRS, an extremely challenging real-world dataset designed to push SLAM +towards all-weather environments to pursue the most robust SLAM performance. It +contains multi-degraded environments including over 30 diverse scenes such as +structureless corridors, varying lighting conditions, and perceptual obscurants +like smoke and dust; multimodal sensors such as LiDAR, fisheye camera, IMU, and +thermal camera; and multiple locomotions like aerial, legged, and wheeled +robots. We develop accuracy and robustness evaluation tracks for SLAM and +introduced novel robustness metrics. Comprehensive studies are performed, +revealing new observations, challenges, and opportunities for future research.",cs.RO,['cs.RO'] +Class Incremental Learning with Multi-Teacher Distillation,Haitao Wen · Lili Pan · Yu Dai · Heqian Qiu · Lanxiao Wang · Qingbo Wu · Hongliang Li, ,https://arxiv.org/abs/2306.17560,,2306.17560.pdf,Class-Incremental Learning using Diffusion Model for Distillation and Replay,"Class-incremental learning aims to learn new classes in an incremental +fashion without forgetting the previously learned ones. Several research works +have shown how additional data can be used by incremental models to help +mitigate catastrophic forgetting. In this work, following the recent +breakthrough in text-to-image generative models and their wide distribution, we +propose the use of a pretrained Stable Diffusion model as a source of +additional data for class-incremental learning. Compared to competitive methods +that rely on external, often unlabeled, datasets of real images, our approach +can generate synthetic samples belonging to the same classes as the previously +encountered images. This allows us to use those additional data samples not +only in the distillation loss but also for replay in the classification loss. +Experiments on the competitive benchmarks CIFAR100, ImageNet-Subset, and +ImageNet demonstrate how this new approach can be used to further improve the +performance of state-of-the-art methods for class-incremental learning on large +scale datasets.",cs.LG,"['cs.LG', 'cs.CV']" +MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization,Jimin Xu · Tianbao Wang · Tao Jin · Shengyu Zhang · Dongjie Fu · Zhe Wang · Jiangjing Lyu · Chengfei Lv · Chaoyue Niu · Zhou Yu · Zhou Zhao · Fei Wu,https://mpod-123.github.io/,https://arxiv.org/abs/2306.17843,,2306.17843.pdf,Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors,"We present Magic123, a two-stage coarse-to-fine approach for high-quality, +textured 3D meshes generation from a single unposed image in the wild using +both2D and 3D priors. In the first stage, we optimize a neural radiance field +to produce a coarse geometry. In the second stage, we adopt a memory-efficient +differentiable mesh representation to yield a high-resolution mesh with a +visually appealing texture. In both stages, the 3D content is learned through +reference view supervision and novel views guided by a combination of 2D and 3D +diffusion priors. We introduce a single trade-off parameter between the 2D and +3D priors to control exploration (more imaginative) and exploitation (more +precise) of the generated geometry. Additionally, we employ textual inversion +and monocular depth regularization to encourage consistent appearances across +views and to prevent degenerate solutions, respectively. Magic123 demonstrates +a significant improvement over previous image-to-3D techniques, as validated +through extensive experiments on synthetic benchmarks and diverse real-world +images. Our code, models, and generated 3D assets are available at +https://github.com/guochengqian/Magic123.",cs.CV,['cs.CV'] +RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation,Oded Bialer · Yuval Haitman,https://yuvalhg.github.io/RadSimReal/,https://arxiv.org/abs/2404.18150,,2404.18150.pdf,RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation,"Object detection in radar imagery with neural networks shows great potential +for improving autonomous driving. However, obtaining annotated datasets from +real radar images, crucial for training these networks, is challenging, +especially in scenarios with long-range detection and adverse weather and +lighting conditions where radar performance excels. To address this challenge, +we present RadSimReal, an innovative physical radar simulation capable of +generating synthetic radar images with accompanying annotations for various +radar types and environmental conditions, all without the need for real data +collection. Remarkably, our findings demonstrate that training object detection +models on RadSimReal data and subsequently evaluating them on real-world data +produce performance levels comparable to models trained and tested on real data +from the same dataset, and even achieves better performance when testing across +different real datasets. RadSimReal offers advantages over other physical radar +simulations that it does not necessitate knowledge of the radar design details, +which are often not disclosed by radar suppliers, and has faster run-time. This +innovative tool has the potential to advance the development of computer vision +algorithms for radar-based autonomous driving applications.",cs.CV,['cs.CV'] +AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error,Jonas Ricker · Denis Lukovnikov · Asja Fischer, ,https://arxiv.org/abs/2401.17879,,2401.17879.pdf,AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error,"With recent text-to-image models, anyone can generate deceptively realistic +images with arbitrary contents, fueling the growing threat of visual +disinformation. A key enabler for generating high-resolution images with low +computational cost has been the development of latent diffusion models (LDMs). +In contrast to conventional diffusion models, LDMs perform the denoising +process in the low-dimensional latent space of a pre-trained autoencoder (AE) +instead of the high-dimensional image space. Despite their relevance, the +forensic analysis of LDMs is still in its infancy. In this work we propose +AEROBLADE, a novel detection method which exploits an inherent component of +LDMs: the AE used to transform images between image and latent space. We find +that generated images can be more accurately reconstructed by the AE than real +images, allowing for a simple detection approach based on the reconstruction +error. Most importantly, our method is easy to implement and does not require +any training, yet nearly matches the performance of detectors that rely on +extensive training. We empirically demonstrate that AEROBLADE is effective +against state-of-the-art LDMs, including Stable Diffusion and Midjourney. +Beyond detection, our approach allows for the qualitative analysis of images, +which can be leveraged for identifying inpainted regions. We release our code +and data at https://github.com/jonasricker/aeroblade .",cs.CV,['cs.CV'] +"Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance",Zan Wang · Yixin Chen · Baoxiong Jia · Puhao Li · Jinlu Zhang · Jingze Zhang · Tengyu Liu · Yixin Zhu · Wei Liang · Siyuan Huang,https://afford-motion.github.io/,https://arxiv.org/abs/2403.18036,,2403.18036.pdf,"Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance","Despite significant advancements in text-to-motion synthesis, generating +language-guided human motion within 3D environments poses substantial +challenges. These challenges stem primarily from (i) the absence of powerful +generative models capable of jointly modeling natural language, 3D scenes, and +human motion, and (ii) the generative models' intensive data requirements +contrasted with the scarcity of comprehensive, high-quality, +language-scene-motion datasets. To tackle these issues, we introduce a novel +two-stage framework that employs scene affordance as an intermediate +representation, effectively linking 3D scene grounding and conditional motion +generation. Our framework comprises an Affordance Diffusion Model (ADM) for +predicting explicit affordance map and an Affordance-to-Motion Diffusion Model +(AMDM) for generating plausible human motions. By leveraging scene affordance +maps, our method overcomes the difficulty in generating human motion under +multimodal condition signals, especially when training with limited data +lacking extensive language-scene-motion pairs. Our extensive experiments +demonstrate that our approach consistently outperforms all baselines on +established benchmarks, including HumanML3D and HUMANISE. Additionally, we +validate our model's exceptional generalization capabilities on a specially +curated evaluation set featuring previously unseen descriptions and scenes.",cs.CV,['cs.CV'] +SignGraph: A Sign Sequence is Worth Graphs of Nodes,Shiwei Gan · Yafeng Yin · Zhiwei Jiang · Hongkai Wen · Lei Xie · Sanglu Lu,https://github.com/gswycf/SignGraph,,https://www.semanticscholar.org/paper/Towards-Real-Time-Sign-Language-Recognition-and-on-Gan-Yin/dba462bcf68db62a4722c7f220f38461ff981f15,,,,,nan +Animating General Image with Large Visual Motion Model,Dengsheng Chen · Xiaoming Wei · Xiaolin Wei, ,https://arxiv.org/abs/2311.12886,,2311.12886.pdf,AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance,"Image animation is a key task in computer vision which aims to generate +dynamic visual content from static image. Recent image animation methods employ +neural based rendering technique to generate realistic animations. Despite +these advancements, achieving fine-grained and controllable image animation +guided by text remains challenging, particularly for open-domain images +captured in diverse real environments. In this paper, we introduce an open +domain image animation method that leverages the motion prior of video +diffusion model. Our approach introduces targeted motion area guidance and +motion strength guidance, enabling precise control the movable area and its +motion speed. This results in enhanced alignment between the animated visual +elements and the prompting text, thereby facilitating a fine-grained and +interactive animation generation process for intricate motion sequences. We +validate the effectiveness of our method through rigorous experiments on an +open-domain dataset, with the results showcasing its superior performance. +Project page can be found at https://animationai.github.io/AnimateAnything.",cs.CV,['cs.CV'] +DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video,Huiqiang Sun · Xingyi Li · Liao Shen · Xinyi Ye · Ke Xian · Zhiguo Cao, ,https://arxiv.org/abs/2403.10103,,2403.10103.pdf,DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video,"Recent advancements in dynamic neural radiance field methods have yielded +remarkable outcomes. However, these approaches rely on the assumption of sharp +input images. When faced with motion blur, existing dynamic NeRF methods often +struggle to generate high-quality novel views. In this paper, we propose +DyBluRF, a dynamic radiance field approach that synthesizes sharp novel views +from a monocular video affected by motion blur. To account for motion blur in +input images, we simultaneously capture the camera trajectory and object +Discrete Cosine Transform (DCT) trajectories within the scene. Additionally, we +employ a global cross-time rendering approach to ensure consistent temporal +coherence across the entire scene. We curate a dataset comprising diverse +dynamic scenes that are specifically tailored for our task. Experimental +results on our dataset demonstrate that our method outperforms existing +approaches in generating sharp novel views from motion-blurred inputs while +maintaining spatial-temporal consistency of the scene.",cs.CV,['cs.CV'] +Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification,Tingting Zheng · Kui Jiang · Hongxun Yao,https://vilab.hit.edu.cn/projects/pamil,https://arxiv.org/abs/2403.07939,,2403.07939.pdf,Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification,"Multi-Instance Learning (MIL) has shown impressive performance for +histopathology whole slide image (WSI) analysis using bags or pseudo-bags. It +involves instance sampling, feature representation, and decision-making. +However, existing MIL-based technologies at least suffer from one or more of +the following problems: 1) requiring high storage and intensive pre-processing +for numerous instances (sampling); 2) potential over-fitting with limited +knowledge to predict bag labels (feature representation); 3) pseudo-bag counts +and prior biases affect model robustness and generalizability +(decision-making). Inspired by clinical diagnostics, using the past sampling +instances can facilitate the final WSI analysis, but it is barely explored in +prior technologies. To break free these limitations, we integrate the dynamic +instance sampling and reinforcement learning into a unified framework to +improve the instance selection and feature aggregation, forming a novel Dynamic +Policy Instance Selection (DPIS) scheme for better and more credible +decision-making. Specifically, the measurement of feature distance and reward +function are employed to boost continuous instance sampling. To alleviate the +over-fitting, we explore the latent global relations among instances for more +robust and discriminative feature representation while establishing reward and +punishment mechanisms to correct biases in pseudo-bags using contrastive +learning. These strategies form the final Dynamic Policy-Driven Adaptive +Multi-Instance Learning (PAMIL) method for WSI tasks. Extensive experiments +reveal that our PAMIL method outperforms the state-of-the-art by 3.8\% on +CAMELYON16 and 4.4\% on TCGA lung cancer datasets.",cs.CV,['cs.CV'] +OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos,Dongyoung Choi · Hyeonjoong Jang · Min H. Kim,https://vclab.kaist.ac.kr/cvpr2024p1,https://arxiv.org/abs/2404.00676,,2404.00676.pdf,OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos,"Omnidirectional cameras are extensively used in various applications to +provide a wide field of vision. However, they face a challenge in synthesizing +novel views due to the inevitable presence of dynamic objects, including the +photographer, in their wide field of view. In this paper, we introduce a new +approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can +render static-only scene views, removing and inpainting dynamic objects +simultaneously. Our approach combines the principles of local radiance fields +with the bidirectional optimization of omnidirectional rays. Our input is an +omnidirectional video, and we evaluate the mutual observations of the entire +angle between the previous and current frames. To reduce ghosting artifacts of +dynamic objects and inpaint occlusions, we devise a multi-resolution motion +mask prediction module. Unlike existing methods that primarily separate dynamic +components through the temporal domain, our method uses multi-resolution neural +feature planes for precise segmentation, which is more suitable for long +360-degree videos. Our experiments validate that OmniLocalRF outperforms +existing methods in both qualitative and quantitative metrics, especially in +scenarios with complex real-world scenes. In particular, our approach +eliminates the need for manual interaction, such as drawing motion masks by +hand and additional pose estimation, making it a highly effective and efficient +solution.",cs.CV,"['cs.CV', 'cs.GR']" +VBench: Comprehensive Benchmark Suite for Video Generative Models,Ziqi Huang · Yinan He · Jiashuo Yu · Fan Zhang · Chenyang Si · Yuming Jiang · Yuanhan Zhang · Tianxing Wu · Jin Qingyang · Nattapol Chanpaisit · Yaohui Wang · Xinyuan Chen · Limin Wang · Dahua Lin · Yu Qiao · Ziwei Liu,https://vchitect.github.io/VBench-project/,https://arxiv.org/abs/2311.17982,,2311.17982.pdf,VBench: Comprehensive Benchmark Suite for Video Generative Models,"Video generation has witnessed significant advancements, yet evaluating these +models remains a challenge. A comprehensive evaluation benchmark for video +generation is indispensable for two reasons: 1) Existing metrics do not fully +align with human perceptions; 2) An ideal evaluation system should provide +insights to inform future developments of video generation. To this end, we +present VBench, a comprehensive benchmark suite that dissects ""video generation +quality"" into specific, hierarchical, and disentangled dimensions, each with +tailored prompts and evaluation methods. VBench has three appealing properties: +1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation +(e.g., subject identity inconsistency, motion smoothness, temporal flickering, +and spatial relationship, etc). The evaluation metrics with fine-grained levels +reveal individual models' strengths and weaknesses. 2) Human Alignment: We also +provide a dataset of human preference annotations to validate our benchmarks' +alignment with human perception, for each evaluation dimension respectively. 3) +Valuable Insights: We look into current models' ability across various +evaluation dimensions, and various content types. We also investigate the gaps +between video and image generation models. We will open-source VBench, +including all prompts, evaluation methods, generated videos, and human +preference annotations, and also include more video generation models in VBench +to drive forward the field of video generation.",cs.CV,['cs.CV'] +Privacy-preserving Optics for Enhancing Protection in Face De-identification,Jhon Lopez · Carlos Hinojosa · Henry Arguello · Bernard Ghanem,https://carloshinojosa.me/project/privacy-face-deid/,https://arxiv.org/abs/2404.00777,,2404.00777.pdf,Privacy-preserving Optics for Enhancing Protection in Face De-identification,"The modern surge in camera usage alongside widespread computer vision +technology applications poses significant privacy and security concerns. +Current artificial intelligence (AI) technologies aid in recognizing relevant +events and assisting in daily tasks in homes, offices, hospitals, etc. The need +to access or process personal information for these purposes raises privacy +concerns. While software-level solutions like face de-identification provide a +good privacy/utility trade-off, they present vulnerabilities to sniffing +attacks. In this paper, we propose a hardware-level face de-identification +method to solve this vulnerability. Specifically, our approach first learns an +optical encoder along with a regression model to obtain a face heatmap while +hiding the face identity from the source image. We also propose an +anonymization framework that generates a new face using the privacy-preserving +image, face heatmap, and a reference face image from a public dataset as input. +We validate our approach with extensive simulations and hardware experiments.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR', 'cs.LG', 'eess.IV']" +Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection,Yicheng Xiao · Zhuoyan Luo · Yong Liu · Yue Ma · Hengwei Bian · Yatai Ji · Yujiu Yang · Xiu Li, ,https://arxiv.org/abs/2311.16464,,2311.16464.pdf,Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection,"Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted +significant attention due to the growing demand for video analysis. Recent +approaches treat MR and HD as similar video grounding problems and address them +together with transformer-based architecture. However, we observe that the +emphasis of MR and HD differs, with one necessitating the perception of local +relationships and the other prioritizing the understanding of global contexts. +Consequently, the lack of task-specific design will inevitably lead to +limitations in associating the intrinsic specialty of two tasks. To tackle the +issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the +gap and jointly solve MR and HD effectively. By performing progressive +integration on intra and inter-modality across multi-granularity, UVCOM +achieves the comprehensive understanding in processing a video. Moreover, we +present multi-aspect contrastive learning to consolidate the local relation +modeling and global knowledge accumulation via well aligned multi-modal space. +Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights +and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which +outperforms the state-of-the-art methods by a remarkable margin.",cs.CV,"['cs.CV', 'cs.AI']" +Hyperbolic Learning with Synthetic Captions for Open-World Detection,Fanjie Kong · Yanbei Chen · Jiarui Cai · Davide Modolo, ,https://arxiv.org/abs/2404.05016,,2404.05016.pdf,Hyperbolic Learning with Synthetic Captions for Open-World Detection,"Open-world detection poses significant challenges, as it requires the +detection of any object using either object class labels or free-form texts. +Existing related works often use large-scale manual annotated caption datasets +for training, which are extremely expensive to collect. Instead, we propose to +transfer knowledge from vision-language models (VLMs) to enrich the +open-vocabulary descriptions automatically. Specifically, we bootstrap dense +synthetic captions using pre-trained VLMs to provide rich descriptions on +different regions in images, and incorporate these captions to train a novel +detector that generalizes to novel concepts. To mitigate the noise caused by +hallucination in synthetic captions, we also propose a novel hyperbolic +vision-language learning approach to impose a hierarchy between visual and +caption embeddings. We call our detector ``HyperLearner''. We conduct extensive +experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, +Object Detection in the Wild, RefCOCO) and our results show that our model +consistently outperforms existing state-of-the-art methods, such as GLIP, +GLIPv2 and Grounding DINO, when using the same backbone.",cs.CV,['cs.CV'] +Coherence As Texture -- Passive Textureless 3D Reconstruction by Self-interference,Wei-Yu Chen · Aswin C. Sankaranarayanan · Anat Levin · Matthew O’Toole, ,,https://onlinelibrary.wiley.com/doi/10.1002/lpor.202301155,,,,,nan +Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment,Angchi Xu · Wei-Shi Zheng, ,https://arxiv.org/abs/2403.19225,,2403.19225.pdf,Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment,"Weakly-supervised action segmentation is a task of learning to partition a +long video into several action segments, where training videos are only +accompanied by transcripts (ordered list of actions). Most of existing methods +need to infer pseudo segmentation for training by serial alignment between all +frames and the transcript, which is time-consuming and hard to be parallelized +while training. In this work, we aim to escape from this inefficient alignment +with massive but redundant frames, and instead to directly localize a few +action transitions for pseudo segmentation generation, where a transition +refers to the change from an action segment to its next adjacent one in the +transcript. As the true transitions are submerged in noisy boundaries due to +intra-segment visual variation, we propose a novel Action-Transition-Aware +Boundary Alignment (ATBA) framework to efficiently and effectively filter out +noisy boundaries and detect transitions. In addition, to boost the semantic +learning in the case that noise is inevitably present in the pseudo +segmentation, we also introduce video-level losses to utilize the trusted +video-level supervision. Extensive experiments show the effectiveness of our +approach on both performance and training speed.",cs.CV,['cs.CV'] +Physics-aware Hand-object Interaction Denoising,Haowen Luo · Yunze Liu · Li Yi, ,https://arxiv.org/abs/2405.11481,,2405.11481.pdf,Physics-aware Hand-object Interaction Denoising,"The credibility and practicality of a reconstructed hand-object interaction +sequence depend largely on its physical plausibility. However, due to high +occlusions during hand-object interaction, physical plausibility remains a +challenging criterion for purely vision-based tracking methods. To address this +issue and enhance the results of existing hand trackers, this paper proposes a +novel physically-aware hand motion de-noising method. Specifically, we +introduce two learned loss terms that explicitly capture two crucial aspects of +physical plausibility: grasp credibility and manipulation feasibility. These +terms are used to train a physically-aware de-noising network. Qualitative and +quantitative experiments demonstrate that our approach significantly improves +both fine-grained physical plausibility and overall pose accuracy, surpassing +current state-of-the-art de-noising methods.",cs.CV,['cs.CV'] +ToNNO: Tomographic Reconstruction of a Neural Network’s Output for Weakly Supervised Segmentation of 3D Medical Images,Marius Schmidt-Mengin · Alexis Benichoux · Shibeshih Belachew · Nikos Komodakis · Nikos Paragios, ,https://arxiv.org/abs/2404.13103,,2404.13103.pdf,ToNNO: Tomographic Reconstruction of a Neural Network's Output for Weakly Supervised Segmentation of 3D Medical Images,"Annotating lots of 3D medical images for training segmentation models is +time-consuming. The goal of weakly supervised semantic segmentation is to train +segmentation models without using any ground truth segmentation masks. Our work +addresses the case where only image-level categorical labels, indicating the +presence or absence of a particular region of interest (such as tumours or +lesions), are available. Most existing methods rely on class activation mapping +(CAM). We propose a novel approach, ToNNO, which is based on the Tomographic +reconstruction of a Neural Network's Output. Our technique extracts stacks of +slices with different angles from the input 3D volume, feeds these slices to a +2D encoder, and applies the inverse Radon transform in order to reconstruct a +3D heatmap of the encoder's predictions. This generic method allows to perform +dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to +weakly supervised medical image segmentation by training the 2D encoder to +output high values for slices containing the regions of interest. We test it on +four large scale medical image datasets and outperform 2D CAM methods. We then +extend ToNNO by combining tomographic reconstruction with CAM methods, +proposing Averaged CAM and Tomographic CAM, which obtain even better results.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']" +An Aggregation-Free Federated Learning for Tackling Data Heterogeneity,Yuan Wang · Huazhu Fu · Renuga Kanagavelu · Qingsong Wei · Yong Liu · Rick Goh, ,https://arxiv.org/abs/2404.18962,,2404.18962.pdf,An Aggregation-Free Federated Learning for Tackling Data Heterogeneity,"The performance of Federated Learning (FL) hinges on the effectiveness of +utilizing knowledge from distributed datasets. Traditional FL methods adopt an +aggregate-then-adapt framework, where clients update local models based on a +global model aggregated by the server from the previous training round. This +process can cause client drift, especially with significant cross-client data +heterogeneity, impacting model performance and convergence of the FL algorithm. +To address these challenges, we introduce FedAF, a novel aggregation-free FL +algorithm. In this framework, clients collaboratively learn condensed data by +leveraging peer knowledge, the server subsequently trains the global model +using the condensed data and soft labels received from the clients. FedAF +inherently avoids the issue of client drift, enhances the quality of condensed +data amid notable data heterogeneity, and improves the global model +performance. Extensive numerical studies on several popular benchmark datasets +show FedAF surpasses various state-of-the-art FL algorithms in handling +label-skew and feature-skew data heterogeneity, leading to superior global +model accuracy and faster convergence.",cs.CV,"['cs.CV', 'cs.LG']" +HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative,CONG MA · Qiao Lei · Chengkai Zhu · Kai Liu · Zelong Kong · Liqing · Xueqi Zhou · Yuheng KAN · Wei Wu, ,https://arxiv.org/abs/2403.02640,,2403.02640.pdf,HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative,"Vehicle-to-everything (V2X) is a popular topic in the field of Autonomous +Driving in recent years. Vehicle-infrastructure cooperation (VIC) becomes one +of the important research area. Due to the complexity of traffic conditions +such as blind spots and occlusion, it greatly limits the perception +capabilities of single-view roadside sensing systems. To further enhance the +accuracy of roadside perception and provide better information to the vehicle +side, in this paper, we constructed holographic intersections with various +layouts to build a large-scale multi-sensor holographic vehicle-infrastructure +cooperation dataset, called HoloVIC. Our dataset includes 3 different types of +sensors (Camera, Lidar, Fisheye) and employs 4 sensor-layouts based on the +different intersections. Each intersection is equipped with 6-18 sensors to +capture synchronous data. While autonomous vehicles pass through these +intersections for collecting VIC data. HoloVIC contains in total on 100k+ +synchronous frames from different sensors. Additionally, we annotated 3D +bounding boxes based on Camera, Fisheye, and Lidar. We also associate the IDs +of the same objects across different devices and consecutive frames in +sequence. Based on HoloVIC, we formulated four tasks to facilitate the +development of related research. We also provide benchmarks for these tasks.",cs.CV,['cs.CV'] +OneFormer3D: One Transformer for Unified Point Cloud Segmentation,Maksim Kolodiazhnyi · Anna Vorontsova · Anton Konushin · Danila Rukhovich,https://github.com/oneformer3d/oneformer3d,https://arxiv.org/abs/2311.14405,,2311.14405.pdf,OneFormer3D: One Transformer for Unified Point Cloud Segmentation,"Semantic, instance, and panoptic segmentation of 3D point clouds have been +addressed using task-specific models of distinct design. Thereby, the +similarity of all segmentation tasks and the implicit relationship between them +have not been utilized effectively. This paper presents a unified, simple, and +effective model addressing all these tasks jointly. The model, named +OneFormer3D, performs instance and semantic segmentation consistently, using a +group of learnable kernels, where each kernel is responsible for generating a +mask for either an instance or a semantic category. These kernels are trained +with a transformer-based decoder with unified instance and semantic queries +passed as an input. Such a design enables training a model end-to-end in a +single run, so that it achieves top performance on all three segmentation tasks +simultaneously. Specifically, our OneFormer3D ranks 1st and sets a new +state-of-the-art (+2.1 mAP50) in the ScanNet test leaderboard. We also +demonstrate the state-of-the-art results in semantic, instance, and panoptic +segmentation of ScanNet (+21 PQ), ScanNet200 (+3.8 mAP50), and S3DIS (+0.8 +mIoU) datasets.",cs.CV,['cs.CV'] +Federated Online Adaptation for Deep Stereo,Matteo Poggi · Fabio Tosi,https://fedstereo.github.io/,http://export.arxiv.org/abs/2405.14873,,2405.14873.pdf,Federated Online Adaptation for Deep Stereo,"We introduce a novel approach for adapting deep stereo networks in a +collaborative manner. By building over principles of federated learning, we +develop a distributed framework allowing for demanding the optimization process +to a number of clients deployed in different environments. This makes it +possible, for a deep stereo network running on resourced-constrained devices, +to capitalize on the adaptation process carried out by other instances of the +same architecture, and thus improve its accuracy in challenging environments +even when it cannot carry out adaptation on its own. Experimental results show +how federated adaptation performs equivalently to on-device adaptation, and +even better when dealing with challenging environments.",cs.CV,['cs.CV'] +Learning Transferable Negative Prompts for Out-of-Distribution Detection,Tianqi Li · Guansong Pang · wenjun miao · Xiao Bai · Jin Zheng, ,,https://paperswithcode.com/paper/learning-transferable-negative-prompts-for,,,,,nan +JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups,Simindokht Jahangard · Zhixi Cai · Shiki Wen · Hamid Rezatofighi, ,https://arxiv.org/abs/2404.04458,,2404.04458.pdf,JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups,"Understanding human social behaviour is crucial in computer vision and +robotics. Micro-level observations like individual actions fall short, +necessitating a comprehensive approach that considers individual behaviour, +intra-group dynamics, and social group levels for a thorough understanding. To +address dataset limitations, this paper introduces JRDB-Social, an extension of +JRDB. Designed to fill gaps in human understanding across diverse indoor and +outdoor social contexts, JRDB-Social provides annotations at three levels: +individual attributes, intra-group interactions, and social group context. This +dataset aims to enhance our grasp of human social dynamics for robotic +applications. Utilizing the recent cutting-edge multi-modal large language +models, we evaluated our benchmark to explore their capacity to decipher social +human behaviour.",cs.CV,['cs.CV'] +Region-Based Representations Revisited,Michal Shlapentokh-Rothman · Ansel Blume · Yao Xiao · Yuqun Wu · Sethuraman T V · Heyi Tao · Jae Yong Lee · Wilfredo Torres-Calderon · Yu-Xiong Wang · Derek Hoiem, ,https://arxiv.org/abs/2402.02352,,2402.02352.pdf,Region-Based Representations Revisited,"We investigate whether region-based representations are effective for +recognition. Regions were once a mainstay in recognition approaches, but pixel +and patch-based features are now used almost exclusively. We show that recent +class-agnostic segmenters like SAM can be effectively combined with strong +unsupervised representations like DINOv2 and used for a wide variety of tasks, +including semantic segmentation, object-based image retrieval, and multi-image +analysis. Once the masks and features are extracted, these representations, +even with linear decoders, enable competitive performance, making them well +suited to applications that require custom queries. The compactness of the +representation also makes it well-suited to video analysis and other problems +requiring inference across many images.",cs.CV,['cs.CV'] +CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data,Wei Fang · Yuxing Tang · Heng Guo · Mingze Yuan · Tony C. W. MOK · Ke Yan · Jiawen Yao · Xin Chen · Zaiyi Liu · Le Lu · Ling Zhang · Minfeng Xu, ,https://arxiv.org/abs/2404.04878,,2404.04878.pdf,CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data,"In the realm of medical 3D data, such as CT and MRI images, prevalent +anisotropic resolution is characterized by high intra-slice but diminished +inter-slice resolution. The lowered resolution between adjacent slices poses +challenges, hindering optimal viewing experiences and impeding the development +of robust downstream analysis algorithms. Various volumetric super-resolution +algorithms aim to surmount these challenges, enhancing inter-slice resolution +and overall 3D medical imaging quality. However, existing approaches confront +inherent challenges: 1) often tailored to specific upsampling factors, lacking +flexibility for diverse clinical scenarios; 2) newly generated slices +frequently suffer from over-smoothing, degrading fine details, and leading to +inter-slice inconsistency. In response, this study presents CycleINR, a novel +enhanced Implicit Neural Representation model for 3D medical data volumetric +super-resolution. Leveraging the continuity of the learned implicit function, +the CycleINR model can achieve results with arbitrary up-sampling rates, +eliminating the need for separate training. Additionally, we enhance the grid +sampling in CycleINR with a local attention mechanism and mitigate +over-smoothing by integrating cycle-consistent loss. We introduce a new metric, +Slice-wise Noise Level Inconsistency (SNLI), to quantitatively assess +inter-slice noise level inconsistency. The effectiveness of our approach is +demonstrated through image quality evaluations on an in-house dataset and a +downstream task analysis on the Medical Segmentation Decathlon liver tumor +dataset.",eess.IV,"['eess.IV', 'cs.CV']" +"Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video",Hongchi Xia · Chih-Hao Lin · Wei-Chiu Ma · Shenlong Wang, ,https://arxiv.org/abs/2404.09833v1,,2404.09833v1.pdf,"Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video","Creating high-quality and interactive virtual environments, such as games and +simulators, often involves complex and costly manual modeling processes. In +this paper, we present Video2Game, a novel approach that automatically converts +videos of real-world scenes into realistic and interactive game environments. +At the heart of our system are three core components:(i) a neural radiance +fields (NeRF) module that effectively captures the geometry and visual +appearance of the scene; (ii) a mesh module that distills the knowledge from +NeRF for faster rendering; and (iii) a physics module that models the +interactions and physical dynamics among the objects. By following the +carefully designed pipeline, one can construct an interactable and actionable +digital replica of the real world. We benchmark our system on both indoor and +large-scale outdoor scenes. We show that we can not only produce +highly-realistic renderings in real-time, but also build interactive games on +top.",cs.CV,"['cs.CV', 'cs.AI']" +Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection,Jin Yang · Ping Wei · Huan Li · Ziyang Ren, ,https://arxiv.org/abs/2404.09263,,2404.09263.pdf,Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection,"Video moment retrieval and highlight detection are two highly valuable tasks +in video understanding, but until recently they have been jointly studied. +Although existing studies have made impressive advancement recently, they +predominantly follow the data-driven bottom-up paradigm. Such paradigm +overlooks task-specific and inter-task effects, resulting in poor model +performance. In this paper, we propose a novel task-driven top-down framework +TaskWeave for joint moment retrieval and highlight detection. The framework +introduces a task-decoupled unit to capture task-specific and common +representations. To investigate the interplay between the two tasks, we propose +an inter-task feedback mechanism, which transforms the results of one task as +guiding masks to assist the other task. Different from existing methods, we +present a task-dependent joint loss function to optimize the model. +Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, +and Charades-STA datasets corroborate the effectiveness and flexibility of the +proposed framework. Codes are available at +https://github.com/EdenGabriel/TaskWeave.",cs.CV,"['cs.CV', 'cs.AI']" +Egocentric Full Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement,Jian Wang · Zhe Cao · Diogo Luvizon · Lingjie Liu · Kripasindhu Sarkar · Danhang Tang · Thabo Beeler · Christian Theobalt, ,https://arxiv.org/abs/2311.16495,,2311.16495.pdf,Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement,"In this work, we explore egocentric whole-body motion capture using a single +fisheye camera, which simultaneously estimates human body and hand motion. This +task presents significant challenges due to three factors: the lack of +high-quality datasets, fisheye camera distortion, and human body +self-occlusion. To address these challenges, we propose a novel approach that +leverages FisheyeViT to extract fisheye image features, which are subsequently +converted into pixel-aligned 3D heatmap representations for 3D human body pose +prediction. For hand tracking, we incorporate dedicated hand detection and hand +pose estimation networks for regressing 3D hand poses. Finally, we develop a +diffusion-based whole-body motion prior model to refine the estimated +whole-body motion while accounting for joint uncertainties. To train these +networks, we collect a large synthetic dataset, EgoWholeBody, comprising +840,000 high-quality egocentric images captured across a diverse range of +whole-body motion sequences. Quantitative and qualitative evaluations +demonstrate the effectiveness of our method in producing high-quality +whole-body motion estimates from a single egocentric camera.",cs.CV,['cs.CV'] +PSDPM: Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation,Xinqiao Zhao · Ziqian Yang · Tianhong Dai · Bingfeng Zhang · Jimin Xiao, ,https://arxiv.org/abs/2405.06586,,2405.06586.pdf,Enhancing Weakly Supervised Semantic Segmentation with Multi-modal Foundation Models: An End-to-End Approach,"Semantic segmentation is a core computer vision problem, but the high costs +of data annotation have hindered its wide application. Weakly-Supervised +Semantic Segmentation (WSSS) offers a cost-efficient workaround to extensive +labeling in comparison to fully-supervised methods by using partial or +incomplete labels. Existing WSSS methods have difficulties in learning the +boundaries of objects leading to poor segmentation results. We propose a novel +and effective framework that addresses these issues by leveraging visual +foundation models inside the bounding box. Adopting a two-stage WSSS framework, +our proposed network consists of a pseudo-label generation module and a +segmentation module. The first stage leverages Segment Anything Model (SAM) to +generate high-quality pseudo-labels. To alleviate the problem of delineating +precise boundaries, we adopt SAM inside the bounding box with the help of +another pre-trained foundation model (e.g., Grounding-DINO). Furthermore, we +eliminate the necessity of using the supervision of image labels, by employing +CLIP in classification. Then in the second stage, the generated high-quality +pseudo-labels are used to train an off-the-shelf segmenter that achieves the +state-of-the-art performance on PASCAL VOC 2012 and MS COCO 2014.",cs.CV,['cs.CV'] +Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation,Haojie Zhang · Yongyi Su · Xun Xu · Kui Jia, ,https://arxiv.org/abs/2312.03502,,2312.03502.pdf,Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation,"The success of large language models has inspired the computer vision +community to explore image segmentation foundation model that is able to +zero/few-shot generalize through prompt engineering. Segment-Anything(SAM), +among others, is the state-of-the-art image segmentation foundation model +demonstrating strong zero/few-shot generalization. Despite the success, recent +studies reveal the weakness of SAM under strong distribution shift. In +particular, SAM performs awkwardly on corrupted natural images, camouflaged +images, medical images, etc. Motivated by the observations, we aim to develop a +self-training based strategy to adapt SAM to target distribution. Given the +unique challenges of large source dataset, high computation cost and incorrect +pseudo label, we propose a weakly supervised self-training architecture with +anchor regularization and low-rank finetuning to improve the robustness and +computation efficiency of adaptation. We validate the effectiveness on 5 types +of downstream segmentation tasks including natural clean/corrupted images, +medical images, camouflaged images and robotic images. Our proposed method is +task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art +domain adaptation methods on almost all downstream tasks with the same testing +prompt inputs.",cs.CV,['cs.CV'] +SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities,Boyuan Chen · Zhuo Xu · Sean Kirmani · brian ichter · Dorsa Sadigh · Leonidas Guibas · Fei Xia,https://spatial-vlm.github.io/,https://arxiv.org/abs/2401.12168,,2401.12168.pdf,SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities,"Understanding and reasoning about spatial relationships is a fundamental +capability for Visual Question Answering (VQA) and robotics. While Vision +Language Models (VLM) have demonstrated remarkable performance in certain VQA +benchmarks, they still lack capabilities in 3D spatial reasoning, such as +recognizing quantitative relationships of physical objects like distances or +size differences. We hypothesize that VLMs' limited spatial reasoning +capability is due to the lack of 3D spatial knowledge in training data and aim +to solve this problem by training VLMs with Internet-scale spatial reasoning +data. To this end, we present a system to facilitate this approach. We first +develop an automatic 3D spatial VQA data generation framework that scales up to +2 billion VQA examples on 10 million real-world images. We then investigate +various factors in the training recipe, including data quality, training +pipeline, and VLM architecture. Our work features the first internet-scale 3D +spatial reasoning dataset in metric space. By training a VLM on such data, we +significantly enhance its ability on both qualitative and quantitative spatial +VQA. Finally, we demonstrate that this VLM unlocks novel downstream +applications in chain-of-thought spatial reasoning and robotics due to its +quantitative estimation capability. Project website: +https://spatial-vlm.github.io/",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG', 'cs.RO']" +Learning to Transform Dynamically for Better Adversarial Transferability,Rongyi Zhu · Zeliang Zhang · Susan Liang · Zhuo Liu · Chenliang Xu, ,https://arxiv.org/abs/2405.14077,,2405.14077.pdf,Learning to Transform Dynamically for Better Adversarial Transferability,"Adversarial examples, crafted by adding perturbations imperceptible to +humans, can deceive neural networks. Recent studies identify the adversarial +transferability across various models, \textit{i.e.}, the cross-model attack +ability of adversarial samples. To enhance such adversarial transferability, +existing input transformation-based methods diversify input data with +transformation augmentation. However, their effectiveness is limited by the +finite number of available transformations. In our study, we introduce a novel +approach named Learning to Transform (L2T). L2T increases the diversity of +transformed images by selecting the optimal combination of operations from a +pool of candidates, consequently improving adversarial transferability. We +conceptualize the selection of optimal transformation combinations as a +trajectory optimization problem and employ a reinforcement learning strategy to +effectively solve the problem. Comprehensive experiments on the ImageNet +dataset, as well as practical tests with Google Vision and GPT-4V, reveal that +L2T surpasses current methodologies in enhancing adversarial transferability, +thereby confirming its effectiveness and practical significance. The code is +available at https://github.com/RongyiZhu/L2T.",cs.CV,"['cs.CV', 'cs.AI']" +Can’t make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models,Himangi Mittal · Nakul Agarwal · Shao-Yuan Lo · Kwonjoon Lee, ,https://arxiv.org/abs/2405.20305,,2405.20305.pdf,Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models,"We introduce PlausiVL, a large video-language model for anticipating action +sequences that are plausible in the real-world. While significant efforts have +been made towards anticipating future actions, prior approaches do not take +into account the aspect of plausibility in an action sequence. To address this +limitation, we explore the generative capability of a large video-language +model in our work and further, develop the understanding of plausibility in an +action sequence by introducing two objective functions, a counterfactual-based +plausible action sequence learning loss and a long-horizon action repetition +loss. We utilize temporal logical constraints as well as verb-noun action pair +logical constraints to create implausible/counterfactual action sequences and +use them to train the model with plausible action sequence learning loss. This +loss helps the model to differentiate between plausible and not plausible +action sequences and also helps the model to learn implicit temporal cues +crucial for the task of action anticipation. The long-horizon action repetition +loss puts a higher penalty on the actions that are more prone to repetition +over a longer temporal window. With this penalization, the model is able to +generate diverse, plausible action sequences. We evaluate our approach on two +large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the +task of action anticipation.",cs.CV,['cs.CV'] +Adapting to Length Shift: FlexiLength Network for Trajectory Prediction,Yi Xu · Yun Fu, ,https://arxiv.org/abs/2404.00742,,2404.00742.pdf,Adapting to Length Shift: FlexiLength Network for Trajectory Prediction,"Trajectory prediction plays an important role in various applications, +including autonomous driving, robotics, and scene understanding. Existing +approaches mainly focus on developing compact neural networks to increase +prediction precision on public datasets, typically employing a standardized +input duration. However, a notable issue arises when these models are evaluated +with varying observation lengths, leading to a significant performance drop, a +phenomenon we term the Observation Length Shift. To address this issue, we +introduce a general and effective framework, the FlexiLength Network (FLN), to +enhance the robustness of existing trajectory prediction techniques against +varying observation periods. Specifically, FLN integrates trajectory data with +diverse observation lengths, incorporates FlexiLength Calibration (FLC) to +acquire temporal invariant representations, and employs FlexiLength Adaptation +(FLA) to further refine these representations for more accurate future +trajectory predictions. Comprehensive experiments on multiple datasets, ie, +ETH/UCY, nuScenes, and Argoverse 1, demonstrate the effectiveness and +flexibility of our proposed FLN framework.",cs.CV,['cs.CV'] +Learning Group Activity Features Through Person Attribute Prediction,Chihiro Nakatani · Hiroaki Kawashima · Norimichi Ukita, ,https://arxiv.org/abs/2403.02753,,2403.02753.pdf,Learning Group Activity Features Through Person Attribute Prediction,"This paper proposes Group Activity Feature (GAF) learning in which features +of multi-person activity are learned as a compact latent vector. Unlike prior +work in which the manual annotation of group activities is required for +supervised learning, our method learns the GAF through person attribute +prediction without group activity annotations. By learning the whole network in +an end-to-end manner so that the GAF is required for predicting the person +attributes of people in a group, the GAF is trained as the features of +multi-person activity. As a person attribute, we propose to use a person's +action class and appearance features because the former is easy to annotate due +to its simpleness, and the latter requires no manual annotation. In addition, +we introduce a location-guided attribute prediction to disentangle the complex +GAF for extracting the features of each target person properly. Various +experimental results validate that our method outperforms SOTA methods +quantitatively and qualitatively on two public datasets. Visualization of our +GAF also demonstrates that our method learns the GAF representing fined-grained +group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024.",cs.CV,['cs.CV'] +Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation,Xingqun Qi · Jiahao Pan · Peng Li · Ruibin Yuan · Xiaowei Chi · Mengfei Li · Wenhan Luo · Wei Xue · Shanghang Zhang · Qifeng Liu · Yike Guo, ,https://arxiv.org/abs/2311.17532,,2311.17532.pdf,Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation,"Generating vivid and emotional 3D co-speech gestures is crucial for virtual +avatar animation in human-machine interaction applications. While the existing +methods enable generating the gestures to follow a single emotion label, they +overlook that long gesture sequence modeling with emotion transition is more +practical in real scenes. In addition, the lack of large-scale available +datasets with emotional transition speech and corresponding 3D human gestures +also limits the addressing of this task. To fulfill this goal, we first +incorporate the ChatGPT-4 and an audio inpainting approach to construct the +high-fidelity emotion transition human speeches. Considering obtaining the +realistic 3D pose annotations corresponding to the dynamically inpainted +emotion transition audio is extremely difficult, we propose a novel weakly +supervised training strategy to encourage authority gesture transitions. +Specifically, to enhance the coordination of transition gestures w.r.t +different emotional ones, we model the temporal association representation +between two different emotional gesture sequences as style guidance and infuse +it into the transition generation. We further devise an emotion mixture +mechanism that provides weak supervision based on a learnable mixed emotion +label for transition gestures. Last, we present a keyframe sampler to supply +effective initial posture cues in long sequences, enabling us to generate +diverse gestures. Extensive experiments demonstrate that our method outperforms +the state-of-the-art models constructed by adapting single emotion-conditioned +counterparts on our newly defined emotion transition task and datasets. Our +code and dataset will be released on the project page: +https://xingqunqi-lab.github.io/Emo-Transition-Gesture/.",cs.CV,['cs.CV'] +FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders,Soumen Basu · Mayuna Gupta · Chetan Madan · Pankaj Gupta · Chetan Arora,https://gbc-iitd.github.io/focusmae,https://arxiv.org/abs/2403.08848,,2403.08848.pdf,FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders,"In recent years, automated Gallbladder Cancer (GBC) detection has gained the +attention of researchers. Current state-of-the-art (SOTA) methodologies relying +on ultrasound sonography (US) images exhibit limited generalization, +emphasizing the need for transformative approaches. We observe that individual +US frames may lack sufficient information to capture disease manifestation. +This study advocates for a paradigm shift towards video-based GBC detection, +leveraging the inherent advantages of spatiotemporal representations. Employing +the Masked Autoencoder (MAE) for representation learning, we address +shortcomings in conventional image-based methods. We propose a novel design +called FocusMAE to systematically bias the selection of masking tokens from +high-information regions, fostering a more refined representation of +malignancy. Additionally, we contribute the most extensive US video dataset for +GBC detection. We also note that, this is the first study on US video-based GBC +detection. We validate the proposed methods on the curated dataset, and report +a new state-of-the-art (SOTA) accuracy of 96.4% for the GBC detection problem, +against an accuracy of 84% by current Image-based SOTA - GBCNet, and RadFormer, +and 94.7% by Video-based SOTA - AdaMAE. We further demonstrate the generality +of the proposed FocusMAE on a public CT-based Covid detection dataset, +reporting an improvement in accuracy by 3.3% over current baselines. The source +code and pretrained models are available at: +https://gbc-iitd.github.io/focusmae",eess.IV,"['eess.IV', 'cs.CV']" +Learning to Predict Activity Progress by Self-Supervised Video Alignment,Gerard Donahue · Ehsan Elhamifar, ,https://arxiv.org/abs/2405.15160,,2405.15160.pdf,ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning,"This paper presents a new self-supervised video representation learning +framework, ARVideo, which autoregressively predicts the next video token in a +tailored sequence order. Two key designs are included. First, we organize +autoregressive video tokens into clusters that span both spatially and +temporally, thereby enabling a richer aggregation of contextual information +compared to the standard spatial-only or temporal-only clusters. Second, we +adopt a randomized spatiotemporal prediction order to facilitate learning from +multi-dimensional data, addressing the limitations of a handcrafted +spatial-first or temporal-first sequence order. Extensive experiments establish +ARVideo as an effective paradigm for self-supervised video representation +learning. For example, when trained with the ViT-B backbone, ARVideo +competitively attains 81.2% on Kinetics-400 and 70.9% on Something-Something +V2, which are on par with the strong benchmark set by VideoMAE. Importantly, +ARVideo also demonstrates higher training efficiency, i.e., it trains 14% +faster and requires 58% less GPU memory compared to VideoMAE.",cs.CV,['cs.CV'] +Revisiting Global Translation Estimation with Feature Tracks,Peilin Tao · Hainan Cui · Mengqi Rong · Shuhan Shen, ,https://arxiv.org/abs/2403.14118,,2403.14118.pdf,From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation,"Machine Translation Quality Estimation (MTQE) is the task of estimating the +quality of machine-translated text in real time without the need for reference +translations, which is of great importance for the development of MT. After two +decades of evolution, QE has yielded a wealth of results. This article provides +a comprehensive overview of QE datasets, annotation methods, shared tasks, +methodologies, challenges, and future research directions. It begins with an +introduction to the background and significance of QE, followed by an +explanation of the concepts and evaluation metrics for word-level QE, +sentence-level QE, document-level QE, and explainable QE. The paper categorizes +the methods developed throughout the history of QE into those based on +handcrafted features, deep learning, and Large Language Models (LLMs), with a +further division of deep learning-based methods into classic deep learning and +those incorporating pre-trained language models (LMs). Additionally, the +article details the advantages and limitations of each method and offers a +straightforward comparison of different approaches. Finally, the paper +discusses the current challenges in QE research and provides an outlook on +future research directions.",cs.CL,['cs.CL'] +Directed Decentralized Collaboration for Personalized Federated Learning,Yingqi Liu · Yifan Shi · Qinglun Li · Baoyuan Wu · Xueqian Wang · Li Shen, ,https://arxiv.org/abs/2405.17876,,2405.17876.pdf,Decentralized Directed Collaboration for Personalized Federated Learning,"Personalized Federated Learning (PFL) is proposed to find the greatest +personalized models for each client. To avoid the central failure and +communication bottleneck in the server-based FL, we concentrate on the +Decentralized Personalized Federated Learning (DPFL) that performs distributed +model training in a Peer-to-Peer (P2P) manner. Most personalized works in DPFL +are based on undirected and symmetric topologies, however, the data, +computation and communication resources heterogeneity result in large variances +in the personalized models, which lead the undirected aggregation to suboptimal +personalized performance and unguaranteed convergence. To address these issues, +we propose a directed collaboration DPFL framework by incorporating stochastic +gradient push and partial model personalized, called \textbf{D}ecentralized +\textbf{Fed}erated \textbf{P}artial \textbf{G}radient \textbf{P}ush +(\textbf{DFedPGP}). It personalizes the linear classifier in the modern deep +model to customize the local solution and learns a consensus representation in +a fully decentralized manner. Clients only share gradients with a subset of +neighbors based on the directed and asymmetric topologies, which guarantees +flexible choices for resource efficiency and better convergence. Theoretically, +we show that the proposed DFedPGP achieves a superior convergence rate of +$\mathcal{O}(\frac{1}{\sqrt{T}})$ in the general non-convex setting, and prove +the tighter connectivity among clients will speed up the convergence. The +proposed method achieves state-of-the-art (SOTA) accuracy in both data and +computation heterogeneity scenarios, demonstrating the efficiency of the +directed collaboration and partial gradient push.",cs.LG,"['cs.LG', 'cs.DC', 'math.OC']" +Towards Calibrated Multi-label Deep Neural Networks,Jiacheng Cheng · Nuno Vasconcelos, ,,https://paperswithcode.com/paper/towards-calibrated-deep-clustering-network,,,,,nan +PolarRec: Improving Radio Interferometric Data Reconstruction Using Polar Coordinates,Ruoqi Wang · Zhuoyang Chen · Jiayi Zhu · Qiong Luo · Feng Wang, ,https://arxiv.org/abs/2308.14610,,2308.14610.pdf,PolarRec: Radio Interferometric Data Reconstruction with Polar Coordinate Representation,"In radio astronomy, visibility data, which are measurements of wave signals +from radio telescopes, are transformed into images for observation of distant +celestial objects. However, these resultant images usually contain both real +sources and artifacts, due to signal sparsity and other factors. One way to +obtain cleaner images is to reconstruct samples into dense forms before +imaging. Unfortunately, existing reconstruction methods often miss some +components of visibility in frequency domain, so blurred object edges and +persistent artifacts remain in the images. Furthermore, the computation +overhead is high on irregular visibility samples due to the data skew. To +address these problems, we propose PolarRec, a transformer-encoder-conditioned +reconstruction pipeline with visibility samples converted into the polar +coordinate representation. This representation matches the way in which radio +telescopes observe a celestial area as the Earth rotates. As a result, +visibility samples distribute in the polar system more uniformly than in the +Cartesian space. Therefore, we propose to use radial distance in the loss +function, to help reconstruct complete visibility effectively. Also, we group +visibility samples by their polar angles and propose a group-based encoding +scheme to improve the efficiency. Our experiments demonstrate that PolarRec +markedly improves imaging results by faithfully reconstructing all frequency +components in the visibility domain while significantly reducing the +computation cost in visibility data encoding. We believe this high-quality and +high-efficiency imaging of PolarRec will better facilitate astronomers to +conduct their research.",astro-ph.IM,"['astro-ph.IM', 'cs.AI', 'cs.CV']" +SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution,Rongyuan Wu · Tao Yang · Lingchen Sun · Zhengqiang ZHANG · Shuai Li · Lei Zhang, ,https://arxiv.org/abs/2311.16518,,2311.16518.pdf,SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution,"Owe to the powerful generative priors, the pre-trained text-to-image (T2I) +diffusion models have become increasingly popular in solving the real-world +image super-resolution problem. However, as a consequence of the heavy quality +degradation of input low-resolution (LR) images, the destruction of local +structures can lead to ambiguous image semantics. As a result, the content of +reproduced high-resolution image may have semantic errors, deteriorating the +super-resolution performance. To address this issue, we present a +semantics-aware approach to better preserve the semantic fidelity of generative +real-world image super-resolution. First, we train a degradation-aware prompt +extractor, which can generate accurate soft and hard semantic prompts even +under strong degradation. The hard semantic prompts refer to the image tags, +aiming to enhance the local perception ability of the T2I model, while the soft +semantic prompts compensate for the hard ones to provide additional +representation information. These semantic prompts can encourage the T2I model +to generate detailed and semantically accurate results. Furthermore, during the +inference process, we integrate the LR images into the initial sampling noise +to mitigate the diffusion model's tendency to generate excessive random +details. The experiments show that our method can reproduce more realistic +image details and hold better the semantics.",cs.CV,['cs.CV'] +PanoContext-Former: Panoramic Total Scene Understanding with a Transformer,Yuan Dong · Chuan Fang · Liefeng Bo · Zilong Dong · Ping Tan,https://fangchuan.github.io/PanoContext-Former/,https://arxiv.org/abs/2312.07378v1,,2312.07378v1.pdf,X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer,"The field of 4D point cloud understanding is rapidly developing with the goal +of analyzing dynamic 3D point cloud sequences. However, it remains a +challenging task due to the sparsity and lack of texture in point clouds. +Moreover, the irregularity of point cloud poses a difficulty in aligning +temporal information within video sequences. To address these issues, we +propose a novel cross-modal knowledge transfer framework, called +X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring +texture priors from RGB sequences using a Transformer architecture with +temporal relationship mining. Specifically, the framework is designed with a +dual-branch architecture, consisting of an 4D point cloud transformer and a +Gradient-aware Image Transformer (GIT). During training, we employ multiple +knowledge transfer techniques, including temporal consistency losses and masked +self-attention, to strengthen the knowledge transfer between modalities. This +leads to enhanced performance during inference using single-modal 4D point +cloud inputs. Extensive experiments demonstrate the superior performance of our +framework on various 4D point cloud video understanding tasks, including action +recognition, action segmentation and semantic segmentation. The results achieve +1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action +segmentation and semantic segmentation, on the HOI4D +challenge\footnote{\url{http://www.hoi4d.top/}.}, outperforming previous +state-of-the-art by a large margin. We release the code at +https://github.com/jinglinglingling/X4D",cs.CV,['cs.CV'] +Text-image Alignment for Diffusion-based Perception,Neehar Kondapaneni · Markus Marks · Manuel Knott · Rogério Guimarães · Pietro Perona,https://www.vision.caltech.edu/tadp/,https://arxiv.org/abs/2310.00031,,2310.00031.pdf,Text-image Alignment for Diffusion-based Perception,"Diffusion models are generative models with impressive text-to-image +synthesis capabilities and have spurred a new wave of creative methods for +classical machine learning tasks. However, the best way to harness the +perceptual knowledge of these generative models for visual tasks is still an +open question. Specifically, it is unclear how to use the prompting interface +when applying diffusion backbones to vision tasks. We find that automatically +generated captions can improve text-image alignment and significantly enhance a +model's cross-attention maps, leading to better perceptual performance. Our +approach improves upon the current state-of-the-art (SOTA) in diffusion-based +semantic segmentation on ADE20K and the current overall SOTA for depth +estimation on NYUv2. Furthermore, our method generalizes to the cross-domain +setting. We use model personalization and caption modifications to align our +model to the target domain and find improvements over unaligned baselines. Our +cross-domain object detection model, trained on Pascal VOC, achieves SOTA +results on Watercolor2K. Our cross-domain segmentation method, trained on +Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. +Project page: https://www.vision.caltech.edu/tadp/. Code: +https://github.com/damaggu/TADP.",cs.CV,['cs.CV'] +DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement,Jiuming Liu · Guangming Wang · Weicai Ye · Chaokang Jiang · Jinru Han · Zhe Liu · Guofeng Zhang · Dalong Du · Hesheng Wang, ,https://arxiv.org/abs/2311.17456,,2311.17456.pdf,DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model,"Scene flow estimation, which aims to predict per-point 3D displacements of +dynamic scenes, is a fundamental task in the computer vision field. However, +previous works commonly suffer from unreliable correlation caused by locally +constrained searching ranges, and struggle with accumulated inaccuracy arising +from the coarse-to-fine structure. To alleviate these problems, we propose a +novel uncertainty-aware scene flow estimation network (DifFlow3D) with the +diffusion probabilistic model. Iterative diffusion-based refinement is designed +to enhance the correlation robustness and resilience to challenging cases, e.g. +dynamics, noisy inputs, repetitive patterns, etc. To restrain the generation +diversity, three key flow-related features are leveraged as conditions in our +diffusion model. Furthermore, we also develop an uncertainty estimation module +within diffusion to evaluate the reliability of estimated scene flow. Our +DifFlow3D achieves state-of-the-art performance, with 24.0% and 29.1% EPE3D +reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably, our +method achieves an unprecedented millimeter-level accuracy (0.0078m in EPE3D) +on the KITTI dataset. Additionally, our diffusion-based refinement paradigm can +be readily integrated as a plug-and-play module into existing scene flow +networks, significantly increasing their estimation accuracy. Codes are +released at https://github.com/IRMVLab/DifFlow3D.",cs.CV,['cs.CV'] +Mind Artist: Creating Artistic Snapshots with Human Thought,Jiaxuan Chen · Yu Qi · Yueming Wang · Gang Pan, ,https://ar5iv.labs.arxiv.org/html/2309.15729,,2309.15729.pdf,MindGPT: Interpreting What You See with Non-invasive Brain Recordings,"Decoding of seen visual contents with non-invasive brain recordings has +important scientific and practical values. Efforts have been made to recover +the seen images from brain signals. However, most existing approaches cannot +faithfully reflect the visual contents due to insufficient image quality or +semantic mismatches. Compared with reconstructing pixel-level visual images, +speaking is a more efficient and effective way to explain visual information. +Here we introduce a non-invasive neural decoder, termed as MindGPT, which +interprets perceived visual stimuli into natural languages from fMRI signals. +Specifically, our model builds upon a visually guided neural encoder with a +cross-attention mechanism, which permits us to guide latent neural +representations towards a desired language semantic direction in an end-to-end +manner by the collaborative use of the large language model GPT. By doing so, +we found that the neural representations of the MindGPT are explainable, which +can be used to evaluate the contributions of visual properties to language +semantics. Our experiments show that the generated word sequences truthfully +represented the visual information (with essential details) conveyed in the +seen stimuli. The results also suggested that with respect to language decoding +tasks, the higher visual cortex (HVC) is more semantically informative than the +lower visual cortex (LVC), and using only the HVC can recover most of the +semantic information. The code of the MindGPT model will be publicly available +at https://github.com/JxuanC/MindGPT.",cs.CV,"['cs.CV', 'cs.AI']" +Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness,Sibo Wang · Jie Zhang · Zheng Yuan · Shiguang Shan,https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness,https://arxiv.org/html/2401.04350v3,,2401.04350v3.pdf,Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness,"Large-scale pre-trained vision-language models like CLIP have demonstrated +impressive performance across various tasks, and exhibit remarkable zero-shot +generalization capability, while they are also vulnerable to imperceptible +adversarial examples. Existing works typically employ adversarial training +(fine-tuning) as a defense method against adversarial examples. However, direct +application to the CLIP model may result in overfitting, compromising the +model's capacity for generalization. In this paper, we propose Pre-trained +Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages +supervision from the original pre-trained model by carefully designing an +auxiliary branch, to enhance the model's zero-shot adversarial robustness. +Specifically, PMG-AFT minimizes the distance between the features of +adversarial examples in the target model and those in the pre-trained model, +aiming to preserve the generalization features already captured by the +pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate +that PMG-AFT significantly outperforms the state-of-the-art method, improving +the top-1 robust accuracy by an average of 4.99%. Furthermore, our approach +consistently improves clean accuracy by an average of 8.72%. Our code is +available at +https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness.",cs.CV,['cs.CV'] +ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image,Marco Pesavento · Yuanlu Xu · Nikolaos Sarafianos · Robert Maier · Ziyan Wang · Chun-Han Yao · Marco Volino · Edmond Boyer · Adrian Hilton · Tony Tung, ,https://arxiv.org/abs/2403.10357,,2403.10357.pdf,ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image,"Recent progress in human shape learning, shows that neural implicit models +are effective in generating 3D human surfaces from limited number of views, and +even from a single RGB image. However, existing monocular approaches still +struggle to recover fine geometric details such as face, hands or cloth +wrinkles. They are also easily prone to depth ambiguities that result in +distorted geometries along the camera optical axis. In this paper, we explore +the benefits of incorporating depth observations in the reconstruction process +by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes +from single-view RGB-D images with an unprecedented level of accuracy. Our +model learns geometric details from both multi-resolution pixel-aligned and +voxel-aligned features to leverage depth information and enable spatial +relationships, mitigating depth ambiguities. We further enhance the quality of +the reconstructed shape by introducing a depth-supervision strategy, which +improves the accuracy of the signed distance field estimation of points that +lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms +state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data +as input. In addition, we introduce ANIM-Real, a new multi-modal dataset +comprising high-quality scans paired with consumer-grade RGB-D camera, and our +protocol to fine-tune ANIM, enabling high-quality reconstruction from +real-world human capture.",cs.CV,"['cs.CV', 'cs.GR']" +GLOW: Global Layout Aware Attacks on Object Detection,Jun Bao · Buyu Liu · Kui Ren · Jun Yu, ,,https://paperswithcode.com/search?q=author:Jun+Yu,,,,,nan +ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention,Jiawei Wang · Changjian Li,https://enigma-li.github.io/projects/contextSeg/contextSeg.html,https://arxiv.org/abs/2311.16682,,2311.16682.pdf,ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention,"Sketch semantic segmentation is a well-explored and pivotal problem in +computer vision involving the assignment of pre-defined part labels to +individual strokes. This paper presents ContextSeg - a simple yet highly +effective approach to tackling this problem with two stages. In the first +stage, to better encode the shape and positional information of strokes, we +propose to predict an extra dense distance field in an autoencoder network to +reinforce structural information learning. In the second stage, we treat an +entire stroke as a single entity and label a group of strokes within the same +semantic part using an auto-regressive Transformer with the default attention +mechanism. By group-based labeling, our method can fully leverage the context +information when making decisions for the remaining groups of strokes. Our +method achieves the best segmentation accuracy compared with state-of-the-art +approaches on two representative datasets and has been extensively evaluated +demonstrating its superior performance. Additionally, we offer insights into +solving part imbalance in training data and the preliminary experiment on +cross-category training, which can inspire future research in this field.",cs.CV,"['cs.CV', 'cs.GR']" +GEARS: Local Geometry-aware Hand-object Interaction Synthesis,Keyang Zhou · Bharat Lal Bhatnagar · Jan Lenssen · Gerard Pons-Moll, ,https://arxiv.org/abs/2404.01758,,2404.01758.pdf,GEARS: Local Geometry-aware Hand-object Interaction Synthesis,"Generating realistic hand motion sequences in interaction with objects has +gained increasing attention with the growing interest in digital humans. Prior +work has illustrated the effectiveness of employing occupancy-based or +distance-based virtual sensors to extract hand-object interaction features. +Nonetheless, these methods show limited generalizability across object +categories, shapes and sizes. We hypothesize that this is due to two reasons: +1) the limited expressiveness of employed virtual sensors, and 2) scarcity of +available training data. To tackle this challenge, we introduce a novel +joint-centered sensor designed to reason about local object geometry near +potential interaction regions. The sensor queries for object surface points in +the neighbourhood of each hand joint. As an important step towards mitigating +the learning complexity, we transform the points from global frame to hand +template frame and use a shared module to process sensor features of each +individual joint. This is followed by a spatio-temporal transformer network +aimed at capturing correlation among the joints in different dimensions. +Moreover, we devise simple heuristic rules to augment the limited training +sequences with vast static hand grasping samples. This leads to a broader +spectrum of grasping types observed during training, in turn enhancing our +model's generalization capability. We evaluate on two public datasets, GRAB and +InterCap, where our method shows superiority over baselines both quantitatively +and perceptually.",cs.CV,['cs.CV'] +Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts,Cansu Korkmaz · Ahmet Murat Tekalp · Zafer Dogan,https://github.com/mandalinadagi/WGSR,,https://paperswithcode.com/paper/training-generative-image-super-resolution,,,,,nan +OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning,Geng Xinyu · Jiaming Wang · Jiawei Gong · yuerong xue · Jun Xu · Fanglin Chen · Xiaolin Huang, ,https://arxiv.org/abs/2403.13351v1,,2403.13351v1.pdf,OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning,"Redundancy is a persistent challenge in Capsule Networks (CapsNet),leading to +high computational costs and parameter counts. Although previous works have +introduced pruning after the initial capsule layer, dynamic routing's fully +connected nature and non-orthogonal weight matrices reintroduce redundancy in +deeper layers. Besides, dynamic routing requires iterating to converge, further +increasing computational demands. In this paper, we propose an Orthogonal +Capsule Network (OrthCaps) to reduce redundancy, improve routing performance +and decrease parameter counts. Firstly, an efficient pruned capsule layer is +introduced to discard redundant capsules. Secondly, dynamic routing is replaced +with orthogonal sparse attention routing, eliminating the need for iterations +and fully connected structures. Lastly, weight matrices during routing are +orthogonalized to sustain low capsule similarity, which is the first approach +to introduce orthogonality into CapsNet as far as we know. Our experiments on +baseline datasets affirm the efficiency and robustness of OrthCaps in +classification tasks, in which ablation studies validate the criticality of +each component. Remarkably, OrthCaps-Shallow outperforms other Capsule Network +benchmarks on four datasets, utilizing only 110k parameters, which is a mere +1.25% of a standard Capsule Network's total. To the best of our knowledge, it +achieves the smallest parameter count among existing Capsule Networks. +Similarly, OrthCaps-Deep demonstrates competitive performance across four +datasets, utilizing only 1.2% of the parameters required by its counterparts.",cs.CV,['cs.CV'] +Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping,Alex Costanzino · Pierluigi Zama Ramirez · Giuseppe Lisanti · Luigi Di Stefano,https://cvlab-unibo.github.io/CrossmodalFeatureMapping/,https://arxiv.org/abs/2312.04521,,2312.04521.pdf,Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping,"The paper explores the industrial multimodal Anomaly Detection (AD) task, +which exploits point clouds and RGB images to localize anomalies. We introduce +a novel light and fast framework that learns to map features from one modality +to the other on nominal samples. At test time, anomalies are detected by +pinpointing inconsistencies between observed and mapped features. Extensive +experiments show that our approach achieves state-of-the-art detection and +segmentation performance in both the standard and few-shot settings on the +MVTec 3D-AD dataset while achieving faster inference and occupying less memory +than previous multimodal AD methods. Moreover, we propose a layer-pruning +technique to improve memory and time efficiency with a marginal sacrifice in +performance.",cs.CV,['cs.CV'] +Discover and Mitigate Multiple Biased Subgroups in Image Classifiers,Zeliang Zhang · Mingqian Feng · Zhiheng Li · Chenliang Xu, ,https://arxiv.org/abs/2403.12777,,2403.12777.pdf,Discover and Mitigate Multiple Biased Subgroups in Image Classifiers,"Machine learning models can perform well on in-distribution data but often +fail on biased subgroups that are underrepresented in the training data, +hindering the robustness of models for reliable applications. Such subgroups +are typically unknown due to the absence of subgroup labels. Discovering biased +subgroups is the key to understanding models' failure modes and further +improving models' robustness. Most previous works of subgroup discovery make an +implicit assumption that models only underperform on a single biased subgroup, +which does not hold on in-the-wild data where multiple biased subgroups exist. + In this work, we propose Decomposition, Interpretation, and Mitigation (DIM), +a novel method to address a more challenging but also more practical problem of +discovering multiple biased subgroups in image classifiers. Our approach +decomposes the image features into multiple components that represent multiple +subgroups. This decomposition is achieved via a bilinear dimension reduction +method, Partial Least Square (PLS), guided by useful supervision from the image +classifier. We further interpret the semantic meaning of each subgroup +component by generating natural language descriptions using vision-language +foundation models. Finally, DIM mitigates multiple biased subgroups +simultaneously via two strategies, including the data- and model-centric +strategies. Extensive experiments on CIFAR-100 and Breeds datasets demonstrate +the effectiveness of DIM in discovering and mitigating multiple biased +subgroups. Furthermore, DIM uncovers the failure modes of the classifier on +Hard ImageNet, showcasing its broader applicability to understanding model bias +in image classifiers. The code is available at +https://github.com/ZhangAIPI/DIM.",cs.CV,"['cs.CV', 'cs.AI']" +RMT: Retentive Networks Meet Vision Transformers,Qihang Fan · Huaibo Huang · Mingrui Chen · Hongmin Liu · Ran He,https://github.com/qhfan/RMT,https://arxiv.org/abs/2309.11523,,2309.11523.pdf,RMT: Retentive Networks Meet Vision Transformers,"Vision Transformer (ViT) has gained increasing attention in the computer +vision community in recent years. However, the core component of ViT, +Self-Attention, lacks explicit spatial priors and bears a quadratic +computational complexity, thereby constraining the applicability of ViT. To +alleviate these issues, we draw inspiration from the recent Retentive Network +(RetNet) in the field of NLP, and propose RMT, a strong vision backbone with +explicit spatial prior for general purposes. Specifically, we extend the +RetNet's temporal decay mechanism to the spatial domain, and propose a spatial +decay matrix based on the Manhattan distance to introduce the explicit spatial +prior to Self-Attention. Additionally, an attention decomposition form that +adeptly adapts to explicit spatial prior is proposed, aiming to reduce the +computational burden of modeling global information without disrupting the +spatial decay matrix. Based on the spatial decay matrix and the attention +decomposition form, we can flexibly integrate explicit spatial prior into the +vision backbone with linear complexity. Extensive experiments demonstrate that +RMT exhibits exceptional performance across various vision tasks. Specifically, +without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on +ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream +tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection +task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is +available at https://github.com/qhfan/RMT",cs.CV,['cs.CV'] +No More Ambiguity in 360$^\circ$ Room Layout via Bi-Layout Estimation,Yu-Ju Tsai · Jin-Cheng Jhang · JINGJING ZHENG · Wei Wang · Albert Chen · Min Sun · Cheng-Hao Kuo · Ming-Hsuan Yang, ,https://arxiv.org/abs/2404.09993,,2404.09993.pdf,No More Ambiguity in 360° Room Layout via Bi-Layout Estimation,"Inherent ambiguity in layout annotations poses significant challenges to +developing accurate 360{\deg} room layout estimation models. To address this +issue, we propose a novel Bi-Layout model capable of predicting two distinct +layout types. One stops at ambiguous regions, while the other extends to +encompass all visible areas. Our model employs two global context embeddings, +where each embedding is designed to capture specific contextual information for +each layout type. With our novel feature guidance module, the image feature +retrieves relevant context from these embeddings, generating layout-aware +features for precise bi-layout predictions. A unique property of our Bi-Layout +model is its ability to inherently detect ambiguous regions by comparing the +two predictions. To circumvent the need for manual correction of ambiguous +annotations during testing, we also introduce a new metric for disambiguating +ground truth layouts. Our method demonstrates superior performance on benchmark +datasets, notably outperforming leading approaches. Specifically, on the +MatterportLayout dataset, it improves 3DIoU from 81.70% to 82.57% across the +full test set and notably from 54.80% to 59.97% in subsets with significant +ambiguity. Project page: https://liagm.github.io/Bi_Layout/",cs.CV,['cs.CV'] +AVID: Any-Length Video Inpainting with Diffusion Model,Zhixing Zhang · Bichen Wu · Xiaoyan Wang · Yaqiao Luo · Luxin Zhang · Yinan Zhao · Peter Vajda · Dimitris N. Metaxas · Licheng Yu,https://zhang-zx.github.io/AVID/,https://arxiv.org/abs/2312.03816,,2312.03816.pdf,AVID: Any-Length Video Inpainting with Diffusion Model,"Recent advances in diffusion models have successfully enabled text-guided +image inpainting. While it seems straightforward to extend such editing +capability into the video domain, there have been fewer works regarding +text-guided video inpainting. Given a video, a masked region at its initial +frame, and an editing prompt, it requires a model to do infilling at each frame +following the editing guidance while keeping the out-of-mask region intact. +There are three main challenges in text-guided video inpainting: ($i$) temporal +consistency of the edited video, ($ii$) supporting different inpainting types +at different structural fidelity levels, and ($iii$) dealing with variable +video length. To address these challenges, we introduce Any-Length Video +Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is +equipped with effective motion modules and adjustable structure guidance, for +fixed-length video inpainting. Building on top of that, we propose a novel +Temporal MultiDiffusion sampling pipeline with a middle-frame attention +guidance mechanism, facilitating the generation of videos with any desired +duration. Our comprehensive experiments show our model can robustly deal with +various inpainting types at different video duration ranges, with high quality. +More visualization results are made publicly available at +https://zhang-zx.github.io/AVID/ .",cs.CV,['cs.CV'] +PaReNeRF: Toward Fast Large-scale Dynamic NeRF with Patch-based Reference,Xiao Tang · Min Yang · Penghui Sun · Hui Li · Yuchao Dai · feng zhu · Hojae Lee, ,https://arxiv.org/abs/2405.08609,,2405.08609.pdf,Dynamic NeRF: A Review,"Neural Radiance Field(NeRF) is an novel implicit method to achieve the 3D +reconstruction and representation with a high resolution. After the first +research of NeRF is proposed, NeRF has gained a robust developing power and is +booming in the 3D modeling, representation and reconstruction areas. However +the first and most of the followed research projects based on NeRF is static, +which are weak in the practical applications. Therefore, more researcher are +interested and focused on the study of dynamic NeRF that is more feasible and +useful in practical applications or situations. Compared with the static NeRF, +implementing the Dynamic NeRF is more difficult and complex. But Dynamic is +more potential in the future even is the basic of Editable NeRF. In this +review, we made a detailed and abundant statement for the development and +important implementation principles of Dynamci NeRF. The analysis of main +principle and development of Dynamic NeRF is from 2021 to 2023, including the +most of the Dynamic NeRF projects. What is more, with colorful and novel +special designed figures and table, We also made a detailed comparison and +analysis of different features of various of Dynamic. Besides, we analyzed and +discussed the key methods to implement a Dynamic NeRF. The volume of the +reference papers is large. The statements and comparisons are multidimensional. +With a reading of this review, the whole development history and most of the +main design method or principles of Dynamic NeRF can be easy understood and +gained.",cs.CV,['cs.CV'] +LoCoNet: Long-Short Context Network for Active Speaker Detection,Xizi Wang · Feng Cheng · Gedas Bertasius, ,https://ar5iv.labs.arxiv.org/html/2301.08237,,2301.08237.pdf,LoCoNet: Long-Short Context Network for Active Speaker Detection,"Active Speaker Detection (ASD) aims to identify who is speaking in each frame +of a video. ASD reasons from audio and visual information from two contexts: +long-term intra-speaker context and short-term inter-speaker context. Long-term +intra-speaker context models the temporal dependencies of the same speaker, +while short-term inter-speaker context models the interactions of speakers in +the same scene. These two contexts are complementary to each other and can help +infer the active speaker. Motivated by these observations, we propose LoCoNet, +a simple yet effective Long-Short Context Network that models the long-term +intra-speaker context and short-term inter-speaker context. We use +self-attention to model long-term intra-speaker context due to its +effectiveness in modeling long-range dependencies, and convolutional blocks +that capture local patterns to model short-term inter-speaker context. +Extensive experiments show that LoCoNet achieves state-of-the-art performance +on multiple datasets, achieving an mAP of 95.2%(+1.1%) on AVA-ActiveSpeaker, +68.1%(+22%) on Columbia dataset, 97.2%(+2.8%) on Talkies dataset and +59.7%(+8.0%) on Ego4D dataset. Moreover, in challenging cases where multiple +speakers are present, or face of active speaker is much smaller than other +faces in the same scene, LoCoNet outperforms previous state-of-the-art methods +by 3.4% on the AVA-ActiveSpeaker dataset. The code will be released at +https://github.com/SJTUwxz/LoCoNet_ASD.",cs.CV,['cs.CV'] +Realigning Confidence with Temporal Saliency Information for Point-Level Weakly-Supervised Temporal Action Localization,Ziying Xia · Jian Cheng · Siyu Liu · Yongxiang Hu · Shiguang Wang · Zhang Yijie · Wanli Dang,https://github.com/zyxia1009/CVPR2024-TSPNet,,https://link.springer.com/article/10.1007/s11063-024-11598-w,,,,,nan +3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos,Jiakai Sun · Han Jiao · Guangyuan Li · Zhanjie Zhang · Lei Zhao · Wei Xing,https://sjojok.github.io/3dgstream/,https://arxiv.org/abs/2403.01444,,2403.01444.pdf,3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos,"Constructing photo-realistic Free-Viewpoint Videos (FVVs) of dynamic scenes +from multi-view videos remains a challenging endeavor. Despite the remarkable +advancements achieved by current neural rendering techniques, these methods +generally require complete video sequences for offline training and are not +capable of real-time rendering. To address these constraints, we introduce +3DGStream, a method designed for efficient FVV streaming of real-world dynamic +scenes. Our method achieves fast on-the-fly per-frame reconstruction within 12 +seconds and real-time rendering at 200 FPS. Specifically, we utilize 3D +Gaussians (3DGs) to represent the scene. Instead of the na\""ive approach of +directly optimizing 3DGs per-frame, we employ a compact Neural Transformation +Cache (NTC) to model the translations and rotations of 3DGs, markedly reducing +the training time and storage required for each FVV frame. Furthermore, we +propose an adaptive 3DG addition strategy to handle emerging objects in dynamic +scenes. Experiments demonstrate that 3DGStream achieves competitive performance +in terms of rendering speed, image quality, training time, and model storage +when compared with state-of-the-art methods.",cs.CV,['cs.CV'] +Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery,Yuqi Zhang · Guanying Chen · Jiaxing Chen · Shuguang Cui,https://zyqz97.github.io/Aerial_Lifting/,https://arxiv.org/abs/2403.11812,,2403.11812.pdf,Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery,"We present a neural radiance field method for urban-scale semantic and +building-level instance segmentation from aerial images by lifting noisy 2D +labels to 3D. This is a challenging problem due to two primary reasons. +Firstly, objects in urban aerial images exhibit substantial variations in size, +including buildings, cars, and roads, which pose a significant challenge for +accurate 2D segmentation. Secondly, the 2D labels generated by existing +segmentation methods suffer from the multi-view inconsistency problem, +especially in the case of aerial images, where each image captures only a small +portion of the entire scene. To overcome these limitations, we first introduce +a scale-adaptive semantic label fusion strategy that enhances the segmentation +of objects of varying sizes by combining labels predicted from different +altitudes, harnessing the novel-view synthesis capabilities of NeRF. We then +introduce a novel cross-view instance label grouping strategy based on the 3D +scene representation to mitigate the multi-view inconsistency problem in the 2D +instance labels. Furthermore, we exploit multi-view reconstructed depth priors +to improve the geometric quality of the reconstructed radiance field, resulting +in enhanced segmentation results. Experiments on multiple real-world +urban-scale datasets demonstrate that our approach outperforms existing +methods, highlighting its effectiveness.",cs.CV,['cs.CV'] +NetTrack: Tracking Highly Dynamic Objects with a Net,Guangze Zheng · Shijie Lin · Haobo Zuo · Changhong Fu · Jia Pan, ,https://arxiv.org/abs/2403.11186,,2403.11186.pdf,NetTrack: Tracking Highly Dynamic Objects with a Net,"The complex dynamicity of open-world objects presents non-negligible +challenges for multi-object tracking (MOT), often manifested as severe +deformations, fast motion, and occlusions. Most methods that solely depend on +coarse-grained object cues, such as boxes and the overall appearance of the +object, are susceptible to degradation due to distorted internal relationships +of dynamic objects. To address this problem, this work proposes NetTrack, an +efficient, generic, and affordable tracking framework to introduce fine-grained +learning that is robust to dynamicity. Specifically, NetTrack constructs a +dynamicity-aware association with a fine-grained Net, leveraging point-level +visual cues. Correspondingly, a fine-grained sampler and matching method have +been incorporated. Furthermore, NetTrack learns object-text correspondence for +fine-grained localization. To evaluate MOT in extremely dynamic open-world +scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits +high dynamicity with diverse species and open-world scenarios. Comprehensive +evaluation on BFT validates the effectiveness of fine-grained learning on +object dynamicity, and thorough transfer experiments on challenging open-world +benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong +generalization ability of NetTrack even without finetuning. Project page: +https://george-zhuang.github.io/nettrack/.",cs.CV,['cs.CV'] +"Advancing Saliency Ranking with Human Fixations: Dataset, Models and Benchmarks",Bowen Deng · Siyang Song · Andrew French · Denis Schluppeck · Michael Pound, ,,https://github.com/topics/saliency-ranking-dateset,,,,,nan +Breathing Life Into Sketches Using Text-to-Video Priors,Rinon Gal · Yael Vinker · Yuval Alaluf · Amit H. Bermano · Daniel Cohen-Or · Ariel Shamir · Gal Chechik, ,https://arxiv.org/abs/2311.13608,,2311.13608.pdf,Breathing Life Into Sketches Using Text-to-Video Priors,"A sketch is one of the most intuitive and versatile tools humans use to +convey their ideas visually. An animated sketch opens another dimension to the +expression of ideas and is widely used by designers for a variety of purposes. +Animating sketches is a laborious process, requiring extensive experience and +professional design skills. In this work, we present a method that +automatically adds motion to a single-subject sketch (hence, ""breathing life +into it""), merely by providing a text prompt indicating the desired motion. The +output is a short animation provided in vector representation, which can be +easily edited. Our method does not require extensive training, but instead +leverages the motion prior of a large pretrained text-to-video diffusion model +using a score-distillation loss to guide the placement of strokes. To promote +natural and smooth motion and to better preserve the sketch's appearance, we +model the learned motion through two components. The first governs small local +deformations and the second controls global affine transformations. +Surprisingly, we find that even models that struggle to generate sketch videos +on their own can still serve as a useful backbone for animating abstract +representations.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +BrainWash: A Poisoning Attack to Forget in Continual Learning,Ali Abbasi · Parsa Nooralinejad · Hamed Pirsiavash · Soheil Kolouri, ,https://arxiv.org/abs/2311.11995,,2311.11995.pdf,BrainWash: A Poisoning Attack to Forget in Continual Learning,"Continual learning has gained substantial attention within the deep learning +community, offering promising solutions to the challenging problem of +sequential learning. Yet, a largely unexplored facet of this paradigm is its +susceptibility to adversarial attacks, especially with the aim of inducing +forgetting. In this paper, we introduce ""BrainWash,"" a novel data poisoning +method tailored to impose forgetting on a continual learner. By adding the +BrainWash noise to a variety of baselines, we demonstrate how a trained +continual learner can be induced to forget its previously learned tasks +catastrophically, even when using these continual learning baselines. An +important feature of our approach is that the attacker requires no access to +previous tasks' data and is armed merely with the model's current parameters +and the data belonging to the most recent task. Our extensive experiments +highlight the efficacy of BrainWash, showcasing degradation in performance +across various regularization-based continual learning methods.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CR']" +ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis,Xiangjun Gao · Xiaoyu Li · Chaopeng Zhang · Qi Zhang · Yan-Pei Cao · Ying Shan · Long Quan, ,https://arxiv.org/abs/2311.17123,,2311.17123.pdf,ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis,"In this work, we propose a method to address the challenge of rendering a 3D +human from a single image in a free-view manner. Some existing approaches could +achieve this by using generalizable pixel-aligned implicit fields to +reconstruct a textured mesh of a human or by employing a 2D diffusion model as +guidance with the Score Distillation Sampling (SDS) method, to lift the 2D +image into 3D space. However, a generalizable implicit field often results in +an over-smooth texture field, while the SDS method tends to lead to a +texture-inconsistent novel view with the input image. In this paper, we +introduce a texture-consistent back view synthesis module that could transfer +the reference image content to the back view through depth and text-guided +attention injection. Moreover, to alleviate the color distortion that occurs in +the side region, we propose a visibility-aware patch consistency regularization +for texture mapping and refinement combined with the synthesized back view +texture. With the above techniques, we could achieve high-fidelity and +texture-consistent human rendering from a single image. Experiments conducted +on both real and synthetic data demonstrate the effectiveness of our method and +show that our approach outperforms previous baseline methods.",cs.CV,"['cs.CV', 'cs.AI']" +NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging,Takahiro Shirakawa · Seiichi Uchida, ,https://arxiv.org/abs/2403.03485,,2403.03485.pdf,NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging,"Layout-aware text-to-image generation is a task to generate multi-object +images that reflect layout conditions in addition to text conditions. The +current layout-aware text-to-image diffusion models still have several issues, +including mismatches between the text and layout conditions and quality +degradation of generated images. This paper proposes a novel layout-aware +text-to-image diffusion model called NoiseCollage to tackle these issues. +During the denoising process, NoiseCollage independently estimates noises for +individual objects and then crops and merges them into a single noise. This +operation helps avoid condition mismatches; in other words, it can put the +right objects in the right places. Qualitative and quantitative evaluations +show that NoiseCollage outperforms several state-of-the-art models. These +successful results indicate that the crop-and-merge operation of noises is a +reasonable strategy to control image generation. We also show that NoiseCollage +can be integrated with ControlNet to use edges, sketches, and pose skeletons as +additional conditions. Experimental results show that this integration boosts +the layout accuracy of ControlNet. The code is available at +https://github.com/univ-esuty/noisecollage.",cs.CV,['cs.CV'] +SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution,Zhixuan Liang · Yao Mu · Hengbo Ma · Masayoshi Tomizuka · Mingyu Ding · Ping Luo,https://skilldiffuser.github.io/,https://arxiv.org/abs/2312.11598,,2312.11598.pdf,SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution,"Diffusion models have demonstrated strong potential for robotic trajectory +planning. However, generating coherent trajectories from high-level +instructions remains challenging, especially for long-range composition tasks +requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end +hierarchical planning framework integrating interpretable skill learning with +conditional diffusion planning to address this problem. At the higher level, +the skill abstraction module learns discrete, human-understandable skill +representations from visual observations and language instructions. These +learned skill embeddings are then used to condition the diffusion model to +generate customized latent trajectories aligned with the skills. This allows +generating diverse state trajectories that adhere to the learnable skills. By +integrating skill learning with conditional trajectory generation, +SkillDiffuser produces coherent behavior following abstract instructions across +diverse tasks. Experiments on multi-task robotic manipulation benchmarks like +Meta-World and LOReL demonstrate state-of-the-art performance and +human-interpretable skill representations from SkillDiffuser. More +visualization results and information could be found on our website.",cs.RO,"['cs.RO', 'cs.CV', 'cs.LG']" +CurveCloudNet: Processing Point Clouds with 1D Structure,Colton Stearns · Alex Fu · Jiateng Liu · Jeong Joon Park · Davis Rempe · Despoina Paschalidou · Leonidas Guibas, ,https://arxiv.org/abs/2312.12743,,2312.12743.pdf,PointeNet: A Lightweight Framework for Effective and Efficient Point Cloud Analysis,"Current methodologies in point cloud analysis predominantly explore 3D +geometries, often achieved through the introduction of intricate learnable +geometric extractors in the encoder or by deepening networks with repeated +blocks. However, these approaches inevitably lead to a significant number of +learnable parameters, resulting in substantial computational costs and imposing +memory burdens on CPU/GPU. Additionally, the existing strategies are primarily +tailored for object-level point cloud classification and segmentation tasks, +with limited extensions to crucial scene-level applications, such as autonomous +driving. In response to these limitations, we introduce PointeNet, an efficient +network designed specifically for point cloud analysis. PointeNet distinguishes +itself with its lightweight architecture, low training cost, and plug-and-play +capability, effectively capturing representative features. The network consists +of a Multivariate Geometric Encoding (MGE) module and an optional +Distance-aware Semantic Enhancement (DSE) module. The MGE module employs +operations of sampling, grouping, and multivariate geometric aggregation to +lightweightly capture and adaptively aggregate multivariate geometric features, +providing a comprehensive depiction of 3D geometries. The DSE module, designed +for real-world autonomous driving scenarios, enhances the semantic perception +of point clouds, particularly for distant points. Our method demonstrates +flexibility by seamlessly integrating with a classification/segmentation head +or embedding into off-the-shelf 3D object detection networks, achieving notable +performance improvements at a minimal cost. Extensive experiments on +object-level datasets, including ModelNet40, ScanObjectNN, ShapeNetPart, and +the scene-level dataset KITTI, demonstrate the superior performance of +PointeNet over state-of-the-art methods in point cloud analysis.",cs.CV,['cs.CV'] +LAN: Learning to Adapt Noise for Image Denoising,Changjin Kim · Tae Hyun Kim · Sungyong Baik, ,https://arxiv.org/abs/2403.15132,,2403.15132.pdf,Transfer CLIP for Generalizable Image Denoising,"Image denoising is a fundamental task in computer vision. While prevailing +deep learning-based supervised and self-supervised methods have excelled in +eliminating in-distribution noise, their susceptibility to out-of-distribution +(OOD) noise remains a significant challenge. The recent emergence of +contrastive language-image pre-training (CLIP) model has showcased exceptional +capabilities in open-world image recognition and segmentation. Yet, the +potential for leveraging CLIP to enhance the robustness of low-level tasks +remains largely unexplored. This paper uncovers that certain dense features +extracted from the frozen ResNet image encoder of CLIP exhibit +distortion-invariant and content-related properties, which are highly desirable +for generalizable denoising. Leveraging these properties, we devise an +asymmetrical encoder-decoder denoising network, which incorporates dense +features including the noisy image and its multi-scale features from the frozen +ResNet encoder of CLIP into a learnable image decoder to achieve generalizable +denoising. The progressive feature augmentation strategy is further proposed to +mitigate feature overfitting and improve the robustness of the learnable +decoder. Extensive experiments and comparisons conducted across diverse OOD +noises, including synthetic noise, real-world sRGB noise, and low-dose CT image +noise, demonstrate the superior generalization ability of our method.",cs.CV,"['cs.CV', 'eess.IV']" +Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training,Shizhan Gong · Qi Dou · Farzan Farnia, ,https://arxiv.org/abs/2404.04647,,2404.04647.pdf,Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training,"Gradient-based saliency maps have been widely used to explain the decisions +of deep neural network classifiers. However, standard gradient-based +interpretation maps, including the simple gradient and integrated gradient +algorithms, often lack desired structures such as sparsity and connectedness in +their application to real-world computer vision models. A frequently used +approach to inducing sparsity structures into gradient-based saliency maps is +to alter the simple gradient scheme using sparsification or norm-based +regularization. A drawback with such post-processing methods is their +frequently-observed significant loss in fidelity to the original simple +gradient map. In this work, we propose to apply adversarial training as an +in-processing scheme to train neural networks with structured simple gradient +maps. We show a duality relation between the regularized norms of the +adversarial perturbations and gradient-based maps, based on which we design +adversarial training loss functions promoting sparsity and group-sparsity +properties in simple gradient maps. We present several numerical results to +show the influence of our proposed norm-based adversarial training methods on +the standard gradient-based maps of standard neural network architectures on +benchmark image datasets.",cs.CV,['cs.CV'] +MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading,Abdallah Dib · Luiz Gustavo Hafemann · Emeline Got · Trevor Anderson · Amin Fadaeinejad · Rafael M. O. Cruz · Marc-André Carbonneau, ,https://arxiv.org/abs/2312.13091v2,,2312.13091v2.pdf,MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading,"Reconstructing an avatar from a portrait image has many applications in +multimedia, but remains a challenging research problem. Extracting reflectance +maps and geometry from one image is ill-posed: recovering geometry is a +one-to-many mapping problem and reflectance and light are difficult to +disentangle. Accurate geometry and reflectance can be captured under the +controlled conditions of a light stage, but it is costly to acquire large +datasets in this fashion. Moreover, training solely with this type of data +leads to poor generalization with in-the-wild images. This motivates the +introduction of MoSAR, a method for 3D avatar generation from monocular images. +We propose a semi-supervised training scheme that improves generalization by +learning from both light stage and in-the-wild datasets. This is achieved using +a novel differentiable shading formulation. We show that our approach +effectively disentangles the intrinsic face parameters, producing relightable +avatars. As a result, MoSAR estimates a richer set of skin reflectance maps, +and generates more realistic avatars than existing state-of-the-art methods. We +also introduce a new dataset, named FFHQ-UV-Intrinsics, the first public +dataset providing intrinsic face attributes at scale (diffuse, specular, +ambient occlusion and translucency maps) for a total of 10k subjects. The +project website and the dataset are available on the following link: +https://ubisoft-laforge.github.io/character/mosar/",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG', '68T45 (Primary) 68T07, 68T01 (Secondary)', 'I.2.10; I.4; I.3.3; I.5']" +Cinematic Behavior Transfer via NeRF-based Differentiable Filming,Xuekun Jiang · Anyi Rao · Jingbo Wang · Dahua Lin · Bo Dai, ,https://arxiv.org/abs/2311.17754,,2311.17754.pdf,Cinematic Behavior Transfer via NeRF-based Differentiable Filming,"In the evolving landscape of digital media and video production, the precise +manipulation and reproduction of visual elements like camera movements and +character actions are highly desired. Existing SLAM methods face limitations in +dynamic scenes and human pose estimation often focuses on 2D projections, +neglecting 3D statuses. To address these issues, we first introduce a reverse +filming behavior estimation technique. It optimizes camera trajectories by +leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then +introduce a cinematic transfer pipeline that is able to transfer various shot +types to a new 2D video or a 3D virtual environment. The incorporation of 3D +engine workflow enables superior rendering and control abilities, which also +achieves a higher rating in the user study.",cs.CV,"['cs.CV', 'cs.GR', 'cs.HC', 'cs.MM']" +Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation,Xiaohan Lei · Min Wang · Wengang Zhou · Li Li · Houqiang Li,https://xiaohanlei.github.io/projects/IEVE/,https://arxiv.org/abs/2402.17587,,2402.17587.pdf,Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation,"As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to +navigate to a specified object depicted by a goal image in an unexplored +environment. + The main challenge of this task lies in identifying the target object from +different viewpoints while rejecting similar distractors. + Existing ImageGoal Navigation methods usually adopt the simple +Exploration-Exploitation framework and ignore the identification of specific +instance during navigation. + In this work, we propose to imitate the human behaviour of ``getting closer +to confirm"" when distinguishing objects from a distance. + Specifically, we design a new modular navigation framework named +Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level +image goal navigation. + Our method allows for active switching among the exploration, verification, +and exploitation actions, thereby facilitating the agent in making reasonable +decisions under different situations. + On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our +method surpasses previous state-of-the-art work, with a classical segmentation +model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success)",cs.CV,"['cs.CV', 'cs.RO']" +TextNeRF: A Novel Scene-Text Image Synthesis Method based on Neural Radiance Fields,Jialei Cui · Jianwei Du · Wenzhuo Liu · Zhouhui Lian, ,https://arxiv.org/abs/2403.01325,,2403.01325.pdf,NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning,"Neural Radiance Fields (NeRF) have garnered remarkable success in novel view +synthesis. Nonetheless, the task of generating high-quality images for novel +views persists as a critical challenge. While the existing efforts have +exhibited commendable progress, capturing intricate details, enhancing +textures, and achieving superior Peak Signal-to-Noise Ratio (PSNR) metrics +warrant further focused attention and advancement. In this work, we propose +NeRF-VPT, an innovative method for novel view synthesis to address these +challenges. Our proposed NeRF-VPT employs a cascading view prompt tuning +paradigm, wherein RGB information gained from preceding rendering outcomes +serves as instructive visual prompts for subsequent rendering stages, with the +aspiration that the prior knowledge embedded in the prompts can facilitate the +gradual enhancement of rendered image quality. NeRF-VPT only requires sampling +RGB data from previous stage renderings as priors at each training stage, +without relying on extra guidance or complex techniques. Thus, our NeRF-VPT is +plug-and-play and can be readily integrated into existing methods. By +conducting comparative analyses of our NeRF-VPT against several NeRF-based +approaches on demanding real-scene benchmarks, such as Realistic Synthetic 360, +Real Forward-Facing, Replica dataset, and a user-captured dataset, we +substantiate that our NeRF-VPT significantly elevates baseline performance and +proficiently generates more high-quality novel view images than all the +compared state-of-the-art methods. Furthermore, the cascading learning of +NeRF-VPT introduces adaptability to scenarios with sparse inputs, resulting in +a significant enhancement of accuracy for sparse-view novel view synthesis. The +source code and dataset are available at +\url{https://github.com/Freedomcls/NeRF-VPT}.",cs.CV,['cs.CV'] +Sparse Global Matching for Video Frame Interpolation with Large Motion,Chunxu Liu · Guozhen Zhang · Rui Zhao · Limin Wang, ,https://arxiv.org/abs/2404.06913,,2404.06913.pdf,Sparse Global Matching for Video Frame Interpolation with Large Motion,"Large motion poses a critical challenge in Video Frame Interpolation (VFI) +task. Existing methods are often constrained by limited receptive fields, +resulting in sub-optimal performance when handling scenarios with large motion. +In this paper, we introduce a new pipeline for VFI, which can effectively +integrate global-level information to alleviate issues associated with large +motion. Specifically, we first estimate a pair of initial intermediate flows +using a high-resolution feature map for extracting local details. Then, we +incorporate a sparse global matching branch to compensate for flow estimation, +which consists of identifying flaws in initial flows and generating sparse flow +compensation with a global receptive field. Finally, we adaptively merge the +initial flow estimation with global flow compensation, yielding a more accurate +intermediate flow. To evaluate the effectiveness of our method in handling +large motion, we carefully curate a more challenging subset from commonly used +benchmarks. Our method demonstrates the state-of-the-art performance on these +VFI subsets with large motion.",cs.CV,['cs.CV'] +StraightPCF: Straight Point Cloud Filtering,Dasith de Silva Edirimuni · Xuequan Lu · Gang Li · Lei Wei · Antonio Robles-Kelly · Hongdong Li,https://ddsediri.github.io/ projects/StraightPCF/,https://arxiv.org/abs/2405.08322,,2405.08322.pdf,StraightPCF: Straight Point Cloud Filtering,"Point cloud filtering is a fundamental 3D vision task, which aims to remove +noise while recovering the underlying clean surfaces. State-of-the-art methods +remove noise by moving noisy points along stochastic trajectories to the clean +surfaces. These methods often require regularization within the training +objective and/or during post-processing, to ensure fidelity. In this paper, we +introduce StraightPCF, a new deep learning based method for point cloud +filtering. It works by moving noisy points along straight paths, thus reducing +discretization errors while ensuring faster convergence to the clean surfaces. +We model noisy patches as intermediate states between high noise patch variants +and their clean counterparts, and design the VelocityModule to infer a constant +flow velocity from the former to the latter. This constant flow leads to +straight filtering trajectories. In addition, we introduce a DistanceModule +that scales the straight trajectory using an estimated distance scalar to +attain convergence near the clean surface. Our network is lightweight and only +has $\sim530K$ parameters, being 17% of IterativePFN (a most recent point cloud +filtering network). Extensive experiments on both synthetic and real-world data +show our method achieves state-of-the-art results. Our method also demonstrates +nice distributions of filtered points without the need for regularization. The +implementation code can be found at: https://github.com/ddsediri/StraightPCF.",cs.CV,['cs.CV'] +MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection,Jakub Micorek · Horst Possegger · Dominik Narnhofer · Horst Bischof · Mateusz Kozinski,https://github.com/jakubmicorek/MULDE-Multiscale-Log-Density-Estimation-via-Denoising-Score-Matching-for-Video-Anomaly-Detection,https://arxiv.org/abs/2403.14497,,2403.14497.pdf,MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection,"We propose a novel approach to video anomaly detection: we treat feature +vectors extracted from videos as realizations of a random variable with a fixed +distribution and model this distribution with a neural network. This lets us +estimate the likelihood of test videos and detect video anomalies by +thresholding the likelihood estimates. We train our video anomaly detector +using a modification of denoising score matching, a method that injects +training data with noise to facilitate modeling its distribution. To eliminate +hyperparameter selection, we model the distribution of noisy video features +across a range of noise levels and introduce a regularizer that tends to align +the models for different levels of noise. At test time, we combine anomaly +indications at multiple noise scales with a Gaussian mixture model. Running our +video anomaly detector induces minimal delays as inference requires merely +extracting the features and forward-propagating them through a shallow neural +network and a Gaussian mixture model. Our experiments on five popular video +anomaly detection benchmarks demonstrate state-of-the-art performance, both in +the object-centric and in the frame-centric setup.",cs.CV,['cs.CV'] +Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models,Peifei Zhu · Tsubasa Takahashi · Hirokatsu Kataoka, ,https://arxiv.org/abs/2404.09401,,2404.09401.pdf,Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models,"Diffusion Models (DMs) have shown remarkable capabilities in various +image-generation tasks. However, there are growing concerns that DMs could be +used to imitate unauthorized creations and thus raise copyright issues. To +address this issue, we propose a novel framework that embeds personal +watermarks in the generation of adversarial examples. Such examples can force +DMs to generate images with visible watermarks and prevent DMs from imitating +unauthorized images. We construct a generator based on conditional adversarial +networks and design three losses (adversarial loss, GAN loss, and perturbation +loss) to generate adversarial examples that have subtle perturbation but can +effectively attack DMs to prevent copyright violations. Training a generator +for a personal watermark by our method only requires 5-10 samples within 2-3 +minutes, and once the generator is trained, it can generate adversarial +examples with that watermark significantly fast (0.2s per image). We conduct +extensive experiments in various conditional image-generation scenarios. +Compared to existing methods that generate images with chaotic textures, our +method adds visible watermarks on the generated images, which is a more +straightforward way to indicate copyright violations. We also observe that our +adversarial examples exhibit good transferability across unknown generative +models. Therefore, this work provides a simple yet powerful way to protect +copyright from DM-based imitation.",cs.CV,"['cs.CV', 'cs.AI']" +Dr. Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering,Yichen Sheng · Zixun Yu · Lu Ling · Zhiwen Cao · Xuaner Zhang · Xin Lu · Ke Xian · Haiting Lin · Bedrich Benes, ,https://arxiv.org/abs/2308.08843,,2308.08843.pdf,Dr.Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering,"Bokeh is widely used in photography to draw attention to the subject while +effectively isolating distractions in the background. Computational methods +simulate bokeh effects without relying on a physical camera lens. However, in +the realm of digital bokeh synthesis, the two main challenges for bokeh +synthesis are color bleeding and partial occlusion at object boundaries. Our +primary goal is to overcome these two major challenges using physics principles +that define bokeh formation. To achieve this, we propose a novel and accurate +filtering-based bokeh rendering equation and a physically-based occlusion-aware +bokeh renderer, dubbed Dr.Bokeh, which addresses the aforementioned challenges +during the rendering stage without the need of post-processing or data-driven +approaches. Our rendering algorithm first preprocesses the input RGBD to obtain +a layered scene representation. Dr.Bokeh then takes the layered representation +and user-defined lens parameters to render photo-realistic lens blur. By +softening non-differentiable operations, we make Dr.Bokeh differentiable such +that it can be plugged into a machine-learning framework. We perform +quantitative and qualitative evaluations on synthetic and real-world images to +validate the effectiveness of the rendering quality and the differentiability +of our method. We show Dr.Bokeh not only outperforms state-of-the-art bokeh +rendering algorithms in terms of photo-realism but also improves the depth +quality from depth-from-defocus.",cs.GR,['cs.GR'] +XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold,Guangyu Wang · Jinzhi Zhang · Fan Wang · Ruqi Huang · Lu Fang, ,https://arxiv.org/abs/2403.19517,,2403.19517.pdf,XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold,"We propose XScale-NVS for high-fidelity cross-scale novel view synthesis of +real-world large-scale scenes. Existing representations based on explicit +surface suffer from discretization resolution or UV distortion, while implicit +volumetric representations lack scalability for large scenes due to the +dispersed weight distribution and surface ambiguity. In light of the above +challenges, we introduce hash featurized manifold, a novel hash-based +featurization coupled with a deferred neural rendering framework. This approach +fully unlocks the expressivity of the representation by explicitly +concentrating the hash entries on the 2D manifold, thus effectively +representing highly detailed contents independent of the discretization +resolution. We also introduce a novel dataset, namely GigaNVS, to benchmark +cross-scale, high-resolution novel view synthesis of realworld large-scale +scenes. Our method significantly outperforms competing baselines on various +real-world scenes, yielding an average LPIPS that is 40% lower than prior +state-of-the-art on the challenging GigaNVS benchmark. Please see our project +page at: xscalenvs.github.io.",cs.CV,['cs.CV'] +Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions,Oindrila Saha · Grant Horn · Subhransu Maji,https://github.com/cvl-umass/AdaptCLIPZS/,https://arxiv.org/abs/2401.02460,,2401.02460.pdf,Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions,"The zero-shot performance of existing vision-language models (VLMs) such as +CLIP is limited by the availability of large-scale, aligned image and text +datasets in specific domains. In this work, we leverage two complementary +sources of information -- descriptions of categories generated by large +language models (LLMs) and abundant, fine-grained image classification datasets +-- to improve the zero-shot classification performance of VLMs across +fine-grained domains. On the technical side, we develop methods to train VLMs +with this ""bag-level"" image-text supervision. We find that simply using these +attributes at test-time does not improve performance, but our training +strategy, for example, on the iNaturalist dataset, leads to an average +improvement of 4-5% in zero-shot classification accuracy for novel categories +of birds and flowers. Similar improvements are observed in domains where a +subset of the categories was used to fine-tune the model. By prompting LLMs in +various ways, we generate descriptions that capture visual appearance, habitat, +and geographic regions and pair them with existing attributes such as the +taxonomic structure of the categories. We systematically evaluate their ability +to improve zero-shot categorization in natural domains. Our findings suggest +that geographic priors can be just as effective and are complementary to visual +appearance. Our method also outperforms prior work on prompt-based tuning of +VLMs. We release the benchmark, consisting of 14 datasets at +https://github.com/cvl-umass/AdaptCLIPZS , which will contribute to future +research in zero-shot recognition.",cs.CV,['cs.CV'] +Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking,Cheng-Yao Hong · Yen-Chi Hsu · Tyng-Luh Liu, ,https://arxiv.org/abs/2401.01448,,2401.01448.pdf,ProbMCL: Simple Probabilistic Contrastive Learning for Multi-label Visual Classification,"Multi-label image classification presents a challenging task in many domains, +including computer vision and medical imaging. Recent advancements have +introduced graph-based and transformer-based methods to improve performance and +capture label dependencies. However, these methods often include complex +modules that entail heavy computation and lack interpretability. In this paper, +we propose Probabilistic Multi-label Contrastive Learning (ProbMCL), a novel +framework to address these challenges in multi-label image classification +tasks. Our simple yet effective approach employs supervised contrastive +learning, in which samples that share enough labels with an anchor image based +on a decision threshold are introduced as a positive set. This structure +captures label dependencies by pulling positive pair embeddings together and +pushing away negative samples that fall below the threshold. We enhance +representation learning by incorporating a mixture density network into +contrastive learning and generating Gaussian mixture distributions to explore +the epistemic uncertainty of the feature encoder. We validate the effectiveness +of our framework through experimentation with datasets from the computer vision +and medical imaging domains. Our method outperforms the existing +state-of-the-art methods while achieving a low computational footprint on both +datasets. Visualization analyses also demonstrate that ProbMCL-learned +classifiers maintain a meaningful semantic topology.",cs.CV,"['cs.CV', 'cs.LG']" +"Towards Automatic Power Battery Detection: New Challenge, Benchmark Dataset and Baseline",Xiaoqi Zhao · Youwei Pang · Zhenyu Chen · Qian Yu · Lihe Zhang · Hanqi Liu · Jiaming Zuo · Huchuan Lu, ,https://arxiv.org/abs/2312.02528,,2312.02528.pdf,"Towards Automatic Power Battery Detection: New Challenge, Benchmark Dataset and Baseline","We conduct a comprehensive study on a new task named power battery detection +(PBD), which aims to localize the dense cathode and anode plates endpoints from +X-ray images to evaluate the quality of power batteries. Existing manufacturers +usually rely on human eye observation to complete PBD, which makes it difficult +to balance the accuracy and efficiency of detection. To address this issue and +drive more attention into this meaningful task, we first elaborately collect a +dataset, called X-ray PBD, which has $1,500$ diverse X-ray images selected from +thousands of power batteries of $5$ manufacturers, with $7$ different visual +interference. Then, we propose a novel segmentation-based solution for PBD, +termed multi-dimensional collaborative network (MDCNet). With the help of line +and counting predictors, the representation of the point segmentation branch +can be improved at both semantic and detail aspects.Besides, we design an +effective distance-adaptive mask generation strategy, which can alleviate the +visual challenge caused by the inconsistent distribution density of plates to +provide MDCNet with stable supervision. Without any bells and whistles, our +segmentation-based MDCNet consistently outperforms various other corner +detection, crowd counting and general/tiny object detection-based solutions, +making it a strong baseline that can help facilitate future research in PBD. +Finally, we share some potential difficulties and works for future researches. +The source code and datasets will be publicly available at +\href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{X-ray PBD}.",cs.CV,['cs.CV'] +SPU-PMD: Self-Supervised Point Cloud Upsampling via Progressive Mesh Deformation,Yanzhe Liu · Rong Chen · Yushi Li · Yixi Li · Xuehou Tan, ,,https://dl.acm.org/doi/10.1109/TPAMI.2023.3287628,,,,,nan +Bidirectional Autoregessive Diffusion Model for Dance Generation,Canyu Zhang · Youbao Tang · NING Zhang · Ruei-Sung Lin · Mei Han · Jing Xiao · Song Wang, ,https://arxiv.org/abs/2402.04356,,2402.04356.pdf,Bidirectional Autoregressive Diffusion Model for Dance Generation,"Dance serves as a powerful medium for expressing human emotions, but the +lifelike generation of dance is still a considerable challenge. Recently, +diffusion models have showcased remarkable generative abilities across various +domains. They hold promise for human motion generation due to their adaptable +many-to-many nature. Nonetheless, current diffusion-based motion generation +models often create entire motion sequences directly and unidirectionally, +lacking focus on the motion with local and bidirectional enhancement. When +choreographing high-quality dance movements, people need to take into account +not only the musical context but also the nearby music-aligned dance motions. +To authentically capture human behavior, we propose a Bidirectional +Autoregressive Diffusion Model (BADM) for music-to-dance generation, where a +bidirectional encoder is built to enforce that the generated dance is +harmonious in both the forward and backward directions. To make the generated +dance motion smoother, a local information decoder is built for local motion +enhancement. The proposed framework is able to generate new motions based on +the input conditions and nearby motions, which foresees individual motion +slices iteratively and consolidates all predictions. To further refine the +synchronicity between the generated dance and the beat, the beat information is +incorporated as an input to generate better music-aligned dance movements. +Experimental results demonstrate that the proposed model achieves +state-of-the-art performance compared to existing unidirectional approaches on +the prominent benchmark for music-to-dance generation.",cs.SD,"['cs.SD', 'cs.CV', 'eess.AS']" +Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair,Jeonghoon Park · Chaeyeon Chung · Jaegul Choo, ,https://arxiv.org/abs/2404.19250,,2404.19250.pdf,Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair,"In the image classification task, deep neural networks frequently rely on +bias attributes that are spuriously correlated with a target class in the +presence of dataset bias, resulting in degraded performance when applied to +data without bias attributes. The task of debiasing aims to compel classifiers +to learn intrinsic attributes that inherently define a target class rather than +focusing on bias attributes. While recent approaches mainly focus on +emphasizing the learning of data samples without bias attributes (i.e., +bias-conflicting samples) compared to samples with bias attributes (i.e., +bias-aligned samples), they fall short of directly guiding models where to +focus for learning intrinsic features. To address this limitation, this paper +proposes a method that provides the model with explicit spatial guidance that +indicates the region of intrinsic features. We first identify the intrinsic +features by investigating the class-discerning common features between a +bias-aligned (BA) sample and a bias-conflicting (BC) sample (i.e., +bias-contrastive pair). Next, we enhance the intrinsic features in the BA +sample that are relatively under-exploited for prediction compared to the BC +sample. To construct the bias-contrastive pair without using bias information, +we introduce a bias-negative score that distinguishes BC samples from BA +samples employing a biased model. The experiments demonstrate that our method +achieves state-of-the-art performance on synthetic and real-world datasets with +various levels of bias severity.",cs.CV,['cs.CV'] +ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation,Jia-Hao Wu · Fu-Jen Tsai · Yan-Tsung Peng · Charles Tsai · Chia-Wen Lin · Yen-Yu Lin,https://github.com/plusgood-steven/ID-Blau,https://arxiv.org/abs/2312.10998v1,,2312.10998v1.pdf,ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation,"Image deblurring aims to remove undesired blurs from an image captured in a +dynamic scene. Much research has been dedicated to improving deblurring +performance through model architectural designs. However, there is little work +on data augmentation for image deblurring. Since continuous motion causes +blurred artifacts during image exposure, we aspire to develop a groundbreaking +blur augmentation method to generate diverse blurred images by simulating +motion trajectories in a continuous space. This paper proposes Implicit +Diffusion-based reBLurring AUgmentation (ID-Blau), utilizing a sharp image +paired with a controllable blur condition map to produce a corresponding +blurred image. We parameterize the blur patterns of a blurred image with their +orientations and magnitudes as a pixel-wise blur condition map to simulate +motion trajectories and implicitly represent them in a continuous space. By +sampling diverse blur conditions, ID-Blau can generate various blurred images +unseen in the training set. Experimental results demonstrate that ID-Blau can +produce realistic blurred images for training and thus significantly improve +performance for state-of-the-art deblurring models.",cs.CV,['cs.CV'] +SLICE: Stabilized LIME for Consistent Explanations for Image Classification,Revoti Prasad Bora · Kiran Raja · Philipp Terhörst · Raymond Veldhuis · Raghavendra Ramachandra, ,https://arxiv.org/abs/2403.17742,,2403.17742.pdf,Using Stratified Sampling to Improve LIME Image Explanations,"We investigate the use of a stratified sampling approach for LIME Image, a +popular model-agnostic explainable AI method for computer vision tasks, in +order to reduce the artifacts generated by typical Monte Carlo sampling. Such +artifacts are due to the undersampling of the dependent variable in the +synthetic neighborhood around the image being explained, which may result in +inadequate explanations due to the impossibility of fitting a linear regressor +on the sampled data. We then highlight a connection with the Shapley theory, +where similar arguments about undersampling and sample relevance were suggested +in the past. We derive all the formulas and adjustment factors required for an +unbiased stratified sampling estimator. Experiments show the efficacy of the +proposed approach.",cs.AI,['cs.AI'] +Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning,Joshua C. Zhao · Ahaan Dabholkar · Atul Sharma · Saurabh Bagchi, ,https://arxiv.org/abs/2403.18144,,2403.18144.pdf,Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning,"Federated learning is a decentralized learning paradigm introduced to +preserve privacy of client data. Despite this, prior work has shown that an +attacker at the server can still reconstruct the private training data using +only the client updates. These attacks are known as data reconstruction attacks +and fall into two major categories: gradient inversion (GI) and linear layer +leakage attacks (LLL). However, despite demonstrating the effectiveness of +these attacks in breaching privacy, prior work has not investigated the +usefulness of the reconstructed data for downstream tasks. In this work, we +explore data reconstruction attacks through the lens of training and improving +models with leaked data. We demonstrate the effectiveness of both GI and LLL +attacks in maliciously training models using the leaked data more accurately +than a benign federated learning strategy. Counter-intuitively, this bump in +training quality can occur despite limited reconstruction quality or a small +total number of leaked images. Finally, we show the limitations of these +attacks for downstream training, individually for GI attacks and for LLL +attacks.",cs.CR,"['cs.CR', 'cs.CV']" +Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning,Nirat Saini · Khoi Pham · Abhinav Shrivastava, ,https://arxiv.org/html/2403.05924v1,,2403.05924v1.pdf,CSCNET: Class-Specified Cascaded Network for Compositional Zero-Shot Learning,"Attribute and object (A-O) disentanglement is a fundamental and critical +problem for Compositional Zero-shot Learning (CZSL), whose aim is to recognize +novel A-O compositions based on foregone knowledge. Existing methods based on +disentangled representation learning lose sight of the contextual dependency +between the A-O primitive pairs. Inspired by this, we propose a novel A-O +disentangled framework for CZSL, namely Class-specified Cascaded Network +(CSCNet). The key insight is to firstly classify one primitive and then +specifies the predicted class as a priori for guiding another primitive +recognition in a cascaded fashion. To this end, CSCNet constructs +Attribute-to-Object and Object-to-Attribute cascaded branches, in addition to a +composition branch modeling the two primitives as a whole. Notably, we devise a +parametric classifier (ParamCls) to improve the matching between visual and +semantic embeddings. By improving the A-O disentanglement, our framework +achieves superior results than previous competitive methods.",cs.CV,['cs.CV'] +Improving Graph Contrastive Learning via Adaptive Positive Sampling,Jiaming Zhuo · Feiyang Qin · Can Cui · Kun Fu · Bingxin Niu · Mengzhu Wang · Yuanfang Guo · Chuan Wang · Zhen Wang · Xiaochun Cao · Liang Yang, ,,https://ieeexplore.ieee.org/document/10181235,,,,,nan +ViewDiff: 3D-Consistent Image Generation with Text-To-Image Models,Lukas Höllein · Aljaž Božič · Norman Müller · David Novotny · Hung-Yu Tseng · Christian Richardt · Michael Zollhoefer · Matthias Nießner,https://lukashoel.github.io/ViewDiff/,https://arxiv.org/abs/2403.01807,,2403.01807.pdf,ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models,"3D asset generation is getting massive amounts of attention, inspired by the +recent success of text-guided 2D content creation. Existing text-to-3D methods +use pretrained text-to-image diffusion models in an optimization problem or +fine-tune them on synthetic data, which often results in non-photorealistic 3D +objects without backgrounds. In this paper, we present a method that leverages +pretrained text-to-image models as a prior, and learn to generate multi-view +images in a single denoising process from real-world data. Concretely, we +propose to integrate 3D volume-rendering and cross-frame-attention layers into +each block of the existing U-Net network of the text-to-image model. Moreover, +we design an autoregressive generation that renders more 3D-consistent images +at any viewpoint. We train our model on real-world datasets of objects and +showcase its capabilities to generate instances with a variety of high-quality +shapes and textures in authentic surroundings. Compared to the existing +methods, the results generated by our method are consistent, and have favorable +visual quality (-30% FID, -37% KID).",cs.CV,['cs.CV'] +FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation,Zijia Lu · Ehsan Elhamifar, ,https://arxiv.org/abs/2308.14900,,2308.14900.pdf,BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation,"We address the task of supervised action segmentation which aims to partition +a video into non-overlapping segments, each representing a different action. +Recent works apply transformers to perform temporal modeling at the +frame-level, which suffer from high computational cost and cannot well capture +action dependencies over long temporal horizons. To address these issues, we +propose an efficient BI-level Temporal modeling (BIT) framework that learns +explicit action tokens to represent action segments, in parallel performs +temporal modeling on frame and action levels, while maintaining a low +computational cost. Our model contains (i) a frame branch that uses convolution +to learn frame-level relationships, (ii) an action branch that uses transformer +to learn action-level dependencies with a small set of action tokens and (iii) +cross-attentions to allow communication between the two branches. We apply and +extend a set-prediction objective to allow each action token to represent one +or multiple action segments, thus can avoid learning a large number of tokens +over long videos with many segments. Thanks to the design of our action branch, +we can also seamlessly leverage textual transcripts of videos (when available) +to help action segmentation by using them to initialize the action tokens. We +evaluate our model on four video datasets (two egocentric and two third-person) +for action segmentation with and without transcripts, showing that BIT +significantly improves the state-of-the-art accuracy with much lower +computational cost (30 times faster) compared to existing transformer-based +methods.",cs.CV,['cs.CV'] +Misalignment-Robust Frequency Distribution Loss for Image Transformation,Zhangkai Ni · Juncheng Wu · Zian Wang · Wenhan Yang · Hanli Wang · Lin Ma, ,https://arxiv.org/html/2402.18192v1,,2402.18192v1.pdf,Misalignment-Robust Frequency Distribution Loss for Image Transformation,"This paper aims to address a common challenge in deep learning-based image +transformation methods, such as image enhancement and super-resolution, which +heavily rely on precisely aligned paired datasets with pixel-level alignments. +However, creating precisely aligned paired images presents significant +challenges and hinders the advancement of methods trained on such data. To +overcome this challenge, this paper introduces a novel and simple Frequency +Distribution Loss (FDL) for computing distribution distance within the +frequency domain. Specifically, we transform image features into the frequency +domain using Discrete Fourier Transformation (DFT). Subsequently, frequency +components (amplitude and phase) are processed separately to form the FDL loss +function. Our method is empirically proven effective as a training constraint +due to the thoughtful utilization of global information in the frequency +domain. Extensive experimental evaluations, focusing on image enhancement and +super-resolution tasks, demonstrate that FDL outperforms existing +misalignment-robust loss functions. Furthermore, we explore the potential of +our FDL for image style transfer that relies solely on completely misaligned +data. Our code is available at: https://github.com/eezkni/FDL",cs.CV,"['cs.CV', 'eess.IV']" +DIEM: Decomposition-Integration Enhancing Multimodal Insights,Xinyi Jiang · Guoming Wang · Junhao Guo · Juncheng Li · Wenqiao Zhang · Rongxing Lu · Siliang Tang, ,,https://ieeexplore.ieee.org/document/10423001,,,,,nan +Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction,Inhwan Bae · Junoh Lee · Hae-Gon Jeon,https://github.com/InhwanBae/LMTrajectory,https://arxiv.org/abs/2403.18447,,2403.18447.pdf,Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction,"Language models have demonstrated impressive ability in context understanding +and generative performance. Inspired by the recent success of language +foundation models, in this paper, we propose LMTraj (Language-based Multimodal +Trajectory predictor), which recasts the trajectory prediction task into a sort +of question-answering problem. Departing from traditional numerical regression +models, which treat the trajectory coordinate sequence as continuous signals, +we consider them as discrete signals like text prompts. Specially, we first +transform an input space for the trajectory coordinate into the natural +language space. Here, the entire time-series trajectories of pedestrians are +converted into a text prompt, and scene images are described as text +information through image captioning. The transformed numerical and image data +are then wrapped into the question-answering template for use in a language +model. Next, to guide the language model in understanding and reasoning +high-level knowledge, such as scene context and social relationships between +pedestrians, we introduce an auxiliary multi-task question and answering. We +then train a numerical tokenizer with the prompt data. We encourage the +tokenizer to separate the integer and decimal parts well, and leverage it to +capture correlations between the consecutive numbers in the language model. +Lastly, we train the language model using the numerical tokenizer and all of +the question-answer prompts. Here, we propose a beam-search-based most-likely +prediction and a temperature-based multimodal prediction to implement both +deterministic and stochastic inferences. Applying our LMTraj, we show that the +language-based model can be a powerful pedestrian trajectory predictor, and +outperforms existing numerical-based predictor methods. Code is publicly +available at https://github.com/inhwanbae/LMTrajectory .",cs.CL,"['cs.CL', 'cs.CV', 'cs.LG', 'cs.RO']" +Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors,Yu Zhang · Songpengcheng Xia · Lei Chu · Jiarui Yang · Qi Wu · Ling Pei, ,https://arxiv.org/abs/2312.02196,,2312.02196.pdf,Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors,"This paper introduces a novel human pose estimation approach using sparse +inertial sensors, addressing the shortcomings of previous methods reliant on +synthetic data. It leverages a diverse array of real inertial motion capture +data from different skeleton formats to improve motion diversity and model +generalization. This method features two innovative components: a +pseudo-velocity regression model for dynamic motion capture with inertial +sensors, and a part-based model dividing the body and sensor data into three +regions, each focusing on their unique characteristics. The approach +demonstrates superior performance over state-of-the-art models across five +public datasets, notably reducing pose error by 19\% on the DIP-IMU dataset, +thus representing a significant improvement in inertial sensor-based human pose +estimation. Our codes are available at {\url{https://github.com/dx118/dynaip}}.",cs.CV,['cs.CV'] +Learning to Remove Wrinkled Transparent Film with Polarized Prior,Jiaqi Tang · RUIZHENG WU · Xiaogang Xu · Sixing Hu · Ying-Cong Chen,https://jqt.me/_FilmRemoval_/,https://arxiv.org/abs/2403.04368v1,,2403.04368v1.pdf,Learning to Remove Wrinkled Transparent Film with Polarized Prior,"In this paper, we study a new problem, Film Removal (FR), which attempts to +remove the interference of wrinkled transparent films and reconstruct the +original information under films for industrial recognition systems. We first +physically model the imaging of industrial materials covered by the film. +Considering the specular highlight from the film can be effectively recorded by +the polarized camera, we build a practical dataset with polarization +information containing paired data with and without transparent film. We aim to +remove interference from the film (specular highlights and other degradations) +with an end-to-end framework. To locate the specular highlight, we use an angle +estimation network to optimize the polarization angle with the minimized +specular highlight. The image with minimized specular highlight is set as a +prior for supporting the reconstruction network. Based on the prior and the +polarized images, the reconstruction network can decouple all degradations from +the film. Extensive experiments show that our framework achieves SOTA +performance in both image reconstruction and industrial downstream tasks. Our +code will be released at \url{https://github.com/jqtangust/FilmRemoval}.",cs.CV,['cs.CV'] +FCS: Feature Calibration and Separation for Non-Exemplar Class Incremental Learning,Qiwei Li · Yuxin Peng · Jiahuan Zhou, ,https://arxiv.org/abs/2312.12722,,2312.12722.pdf,Fine-Grained Knowledge Selection and Restoration for Non-Exemplar Class Incremental Learning,"Non-exemplar class incremental learning aims to learn both the new and old +tasks without accessing any training data from the past. This strict +restriction enlarges the difficulty of alleviating catastrophic forgetting +since all techniques can only be applied to current task data. Considering this +challenge, we propose a novel framework of fine-grained knowledge selection and +restoration. The conventional knowledge distillation-based methods place too +strict constraints on the network parameters and features to prevent +forgetting, which limits the training of new tasks. To loose this constraint, +we proposed a novel fine-grained selective patch-level distillation to +adaptively balance plasticity and stability. Some task-agnostic patches can be +used to preserve the decision boundary of the old task. While some patches +containing the important foreground are favorable for learning the new task. + Moreover, we employ a task-agnostic mechanism to generate more realistic +prototypes of old tasks with the current task sample for reducing classifier +bias for fine-grained knowledge restoration. Extensive experiments on CIFAR100, +TinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method. +Code is available at https://github.com/scok30/vit-cil.",cs.CV,['cs.CV'] +Video-Based Human Pose Regression via Decoupled Space-Time Aggregation,Jijie He · Wenwu Yang,https://github.com/zgspose/DSTA,https://arxiv.org/abs/2403.19926,,2403.19926.pdf,Video-Based Human Pose Regression via Decoupled Space-Time Aggregation,"By leveraging temporal dependency in video sequences, multi-frame human pose +estimation algorithms have demonstrated remarkable results in complicated +situations, such as occlusion, motion blur, and video defocus. These algorithms +are predominantly based on heatmaps, resulting in high computation and storage +requirements per frame, which limits their flexibility and real-time +application in video scenarios, particularly on edge devices. In this paper, we +develop an efficient and effective video-based human pose regression method, +which bypasses intermediate representations such as heatmaps and instead +directly maps the input to the output joint coordinates. Despite the inherent +spatial correlation among adjacent joints of the human pose, the temporal +trajectory of each individual joint exhibits relative independence. In light of +this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to +separately capture the spatial contexts between adjacent joints and the +temporal cues of each individual joint, thereby avoiding the conflation of +spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token +for each joint to facilitate the modeling of their spatiotemporal dependencies. +With the proposed joint-wise local-awareness attention mechanism, our method is +capable of efficiently and flexibly utilizing the spatial dependency of +adjacent joints and the temporal dependency of each joint itself. Extensive +experiments demonstrate the superiority of our method. Compared to previous +regression-based single-frame human pose estimation methods, DSTA significantly +enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. +Furthermore, our approach either surpasses or is on par with the +state-of-the-art heatmap-based multi-frame human pose estimation methods. +Project page: https://github.com/zgspose/DSTA.",cs.CV,"['cs.CV', 'I.4.9']" +$L_0$-Sampler: An $L_{0}$ Model Guided Volume Sampling for NeRF,Liangchen Li · Juyong Zhang, ,https://arxiv.org/abs/2311.07044,,2311.07044.pdf,$L_0$-Sampler: An $L_{0}$ Model Guided Volume Sampling for NeRF,"Since being proposed, Neural Radiance Fields (NeRF) have achieved great +success in related tasks, mainly adopting the hierarchical volume sampling +(HVS) strategy for volume rendering. However, the HVS of NeRF approximates +distributions using piecewise constant functions, which provides a relatively +rough estimation. Based on the observation that a well-trained weight function +$w(t)$ and the $L_0$ distance between points and the surface have very high +similarity, we propose $L_0$-Sampler by incorporating the $L_0$ model into +$w(t)$ to guide the sampling process. Specifically, we propose to use piecewise +exponential functions rather than piecewise constant functions for +interpolation, which can not only approximate quasi-$L_0$ weight distributions +along rays quite well but also can be easily implemented with few lines of code +without additional computational burden. Stable performance improvements can be +achieved by applying $L_0$-Sampler to NeRF and its related tasks like 3D +reconstruction. Code is available at https://ustc3dv.github.io/L0-Sampler/ .",cs.CV,"['cs.CV', 'cs.GR']" +3DInAction: Understanding Human Actions in 3D Point Clouds,Yizhak Ben-Shabat · Oren Shrout · Stephen Gould, ,https://arxiv.org/html/2303.06346v2,,2303.06346v2.pdf,3DInAction: Understanding Human Actions in 3D Point Clouds,"We propose a novel method for 3D point cloud action recognition. +Understanding human actions in RGB videos has been widely studied in recent +years, however, its 3D point cloud counterpart remains under-explored. This is +mostly due to the inherent limitation of the point cloud data modality -- lack +of structure, permutation invariance, and varying number of points -- which +makes it difficult to learn a spatio-temporal representation. To address this +limitation, we propose the 3DinAction pipeline that first estimates patches +moving in time (t-patches) as a key building block, alongside a hierarchical +architecture that learns an informative spatio-temporal representation. We show +that our method achieves improved performance on existing datasets, including +DFAUST and IKEA ASM. Code is publicly available at +https://github.com/sitzikbs/3dincaction.",cs.CV,['cs.CV'] +Poly Kernel Inception Network for Remote Sensing Detection,Xinhao Cai · Qiuxia Lai · Yuwei Wang · Wenguan Wang · Zeren Sun · Yazhou Yao, ,https://arxiv.org/abs/2403.06258,,2403.06258.pdf,Poly Kernel Inception Network for Remote Sensing Detection,"Object detection in remote sensing images (RSIs) often suffers from several +increasing challenges, including the large variation in object scales and the +diverse-ranging context. Prior methods tried to address these challenges by +expanding the spatial receptive field of the backbone, either through +large-kernel convolution or dilated convolution. However, the former typically +introduces considerable background noise, while the latter risks generating +overly sparse feature representations. In this paper, we introduce the Poly +Kernel Inception Network (PKINet) to handle the above challenges. PKINet +employs multi-scale convolution kernels without dilation to extract object +features of varying scales and capture local context. In addition, a Context +Anchor Attention (CAA) module is introduced in parallel to capture long-range +contextual information. These two components work jointly to advance the +performance of PKINet on four challenging remote sensing detection benchmarks, +namely DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R.",cs.CV,['cs.CV'] +Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning,Zihua Zhao · Mengxi Chen · Tianjie Dai · Jiangchao Yao · Bo Han · Ya Zhang · Yanfeng Wang, ,https://arxiv.org/abs/2405.16996,,2405.16996.pdf,Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning,"Noisy correspondence that refers to mismatches in cross-modal data pairs, is +prevalent on human-annotated or web-crawled datasets. Prior approaches to +leverage such data mainly consider the application of uni-modal noisy label +learning without amending the impact on both cross-modal and intra-modal +geometrical structures in multimodal learning. Actually, we find that both +structures are effective to discriminate noisy correspondence through +structural differences when being well-established. Inspired by this +observation, we introduce a Geometrical Structure Consistency (GSC) method to +infer the true correspondence. Specifically, GSC ensures the preservation of +geometrical structures within and between modalities, allowing for the accurate +discrimination of noisy samples based on structural differences. Utilizing +these inferred true correspondence labels, GSC refines the learning of +geometrical structures by filtering out the noisy samples. Experiments across +four cross-modal datasets confirm that GSC effectively identifies noisy samples +and significantly outperforms the current leading methods.",cs.CV,['cs.CV'] +Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution,Guangyuan Li · Chen Rao · Juncheng Mo · Zhanjie Zhang · Wei Xing · Lei Zhao, ,https://arxiv.org/abs/2404.04785,,2404.04785.pdf,Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution,"Recently, diffusion models (DM) have been applied in magnetic resonance +imaging (MRI) super-resolution (SR) reconstruction, exhibiting impressive +performance, especially with regard to detailed reconstruction. However, the +current DM-based SR reconstruction methods still face the following issues: (1) +They require a large number of iterations to reconstruct the final image, which +is inefficient and consumes a significant amount of computational resources. +(2) The results reconstructed by these methods are often misaligned with the +real high-resolution images, leading to remarkable distortion in the +reconstructed MR images. To address the aforementioned issues, we propose an +efficient diffusion model for multi-contrast MRI SR, named as DiffMSR. +Specifically, we apply DM in a highly compact low-dimensional latent space to +generate prior knowledge with high-frequency detail information. The highly +compact latent space ensures that DM requires only a few simple iterations to +produce accurate prior knowledge. In addition, we design the Prior-Guide Large +Window Transformer (PLWformer) as the decoder for DM, which can extend the +receptive field while fully utilizing the prior knowledge generated by DM to +ensure that the reconstructed MR image remains undistorted. Extensive +experiments on public and clinical datasets demonstrate that our DiffMSR +outperforms state-of-the-art methods.",cs.CV,['cs.CV'] +HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models,Li Pang · Xiangyu Rui · Long Cui · Hongzhong Wang · Deyu Meng · Xiangyong Cao, ,https://arxiv.org/abs/2402.15865,,2402.15865.pdf,HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models,"Hyperspectral image (HSI) restoration aims at recovering clean images from +degraded observations and plays a vital role in downstream tasks. Existing +model-based methods have limitations in accurately modeling the complex image +characteristics with handcraft priors, and deep learning-based methods suffer +from poor generalization ability. To alleviate these issues, this paper +proposes an unsupervised HSI restoration framework with pre-trained diffusion +model (HIR-Diff), which restores the clean HSIs from the product of two +low-rank components, i.e., the reduced image and the coefficient matrix. +Specifically, the reduced image, which has a low spectral dimension, lies in +the image field and can be inferred from our improved diffusion model where a +new guidance function with total variation (TV) prior is designed to ensure +that the reduced image can be well sampled. The coefficient matrix can be +effectively pre-estimated based on singular value decomposition (SVD) and +rank-revealing QR (RRQR) factorization. Furthermore, a novel exponential noise +schedule is proposed to accelerate the restoration process (about 5$\times$ +acceleration for denoising) with little performance decrease. Extensive +experimental results validate the superiority of our method in both performance +and speed on a variety of HSI restoration tasks, including HSI denoising, noisy +HSI super-resolution, and noisy HSI inpainting. The code is available at +https://github.com/LiPang/HIRDiff.",cs.CV,"['cs.CV', 'eess.IV']" +Zero-Shot Structure-Preserving Diffusion Model for High Dynamic Range Tone Mapping,Ruoxi Zhu · Shusong Xu · Peiye Liu · Sicheng Li · Yanheng Lu · Dimin Niu · Zihao Liu · Zihao Meng · Li Zhiyong · Xinhua Chen · Yibo Fan, ,https://arxiv.org/abs/2309.16975,,2309.16975.pdf,Perceptual Tone Mapping Model for High Dynamic Range Imaging,"One of the key challenges in tone mapping is to preserve the perceptual +quality of high dynamic range (HDR) images when mapping them to standard +dynamic range (SDR) displays. Traditional tone mapping operators (TMOs) +compress the luminance of HDR images without considering the surround and +display conditions emanating into suboptimal results. Current research +addresses this challenge by incorporating perceptual color appearance +attributes. In this work, we propose a TMO (TMOz) that leverages CIECAM16 +perceptual attributes, i.e., brightness, colorfulness, and hue. TMOz accounts +for the effects of both the surround and the display conditions to achieve more +optimal colorfulness reproduction. The perceptual brightness is compressed, and +the perceptual color scales, i.e., colorfulness and hue are derived from HDR +images by employing CIECAM16 color adaptation equations. A psychophysical +experiment was conducted to automate the brightness compression parameter. The +model employs fully automatic and adaptive approach, obviating the requirement +for manual parameter selection. TMOz was evaluated in terms of contrast, +colorfulness and overall image quality. The objective and subjective evaluation +methods revealed that the proposed model outperformed the state-of-the-art +TMOs.",cs.CV,"['cs.CV', 'eess.IV']" +VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning,Kang Chen · Xiangqian Wu,https://visual-text-qa.github.io/,,https://dl.acm.org/doi/pdf/10.1145/3581783.3612850,,,,,nan +Composing Object Relations and Attributes for Image-Text Matching,Khoi Pham · Chuong Huynh · Ser-Nam Lim · Abhinav Shrivastava, ,,https://hmchuong.github.io/,,,,,nan +HashPoint: Accelerated Point Searching and Sampling for Neural Rendering,Jiahao Ma · Miaomiao Liu · David Ahmedt-Aristizabal · Chuong Nguyen,https://jiahao-ma.github.io/hashpoint/,https://arxiv.org/abs/2404.14044,,2404.14044.pdf,HashPoint: Accelerated Point Searching and Sampling for Neural Rendering,"In this paper, we address the problem of efficient point searching and +sampling for volume neural rendering. Within this realm, two typical approaches +are employed: rasterization and ray tracing. The rasterization-based methods +enable real-time rendering at the cost of increased memory and lower fidelity. +In contrast, the ray-tracing-based methods yield superior quality but demand +longer rendering time. We solve this problem by our HashPoint method combining +these two strategies, leveraging rasterization for efficient point searching +and sampling, and ray marching for rendering. Our method optimizes point +searching by rasterizing points within the camera's view, organizing them in a +hash table, and facilitating rapid searches. Notably, we accelerate the +rendering process by adaptive sampling on the primary surface encountered by +the ray. Our approach yields substantial speed-up for a range of +state-of-the-art ray-tracing-based methods, maintaining equivalent or superior +accuracy across synthetic and real test datasets. The code will be available at +https://jiahao-ma.github.io/hashpoint/.",cs.CV,['cs.CV'] +Improving Depth Completion via Depth Feature Upsampling,Yufei Wang · Ge Zhang · Shaoqian Wang · Bo Li · Qi Liu · Le Hui · Yuchao Dai, ,https://arxiv.org/abs/2310.08956,,2310.08956.pdf,LRRU: Long-short Range Recurrent Updating Networks for Depth Completion,"Existing deep learning-based depth completion methods generally employ +massive stacked layers to predict the dense depth map from sparse input data. +Although such approaches greatly advance this task, their accompanied huge +computational complexity hinders their practical applications. To accomplish +depth completion more efficiently, we propose a novel lightweight deep network +framework, the Long-short Range Recurrent Updating (LRRU) network. Without +learning complex feature representations, LRRU first roughly fills the sparse +input to obtain an initial dense depth map, and then iteratively updates it +through learned spatially-variant kernels. Our iterative update process is +content-adaptive and highly flexible, where the kernel weights are learned by +jointly considering the guidance RGB images and the depth map to be updated, +and large-to-small kernel scopes are dynamically adjusted to capture +long-to-short range dependencies. Our initial depth map has coarse but complete +scene depth information, which helps relieve the burden of directly regressing +the dense depth from sparse ones, while our proposed method can effectively +refine it to an accurate depth map with less learnable parameters and inference +time. Experimental results demonstrate that our proposed LRRU variants achieve +state-of-the-art performance across different parameter regimes. In particular, +the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and +ranks 1st on the KITTI depth completion benchmark at the time of submission. +Project page: https://npucvr.github.io/LRRU/.",cs.CV,['cs.CV'] +Improving Out-of-Distribution Generalization in Graphs via Hierarchical Semantic Environments,Yinhua Piao · Sangseon Lee · Yijingxiu Lu · Sun Kim,https://github.com/qkrdmsghk/GOODHSE,https://arxiv.org/abs/2403.01773,,2403.01773.pdf,Improving out-of-distribution generalization in graphs via hierarchical semantic environments,"Out-of-distribution (OOD) generalization in the graph domain is challenging +due to complex distribution shifts and a lack of environmental contexts. Recent +methods attempt to enhance graph OOD generalization by generating flat +environments. However, such flat environments come with inherent limitations to +capture more complex data distributions. Considering the DrugOOD dataset, which +contains diverse training environments (e.g., scaffold, size, etc.), flat +contexts cannot sufficiently address its high heterogeneity. Thus, a new +challenge is posed to generate more semantically enriched environments to +enhance graph invariant learning for handling distribution shifts. In this +paper, we propose a novel approach to generate hierarchical semantic +environments for each graph. Firstly, given an input graph, we explicitly +extract variant subgraphs from the input graph to generate proxy predictions on +local environments. Then, stochastic attention mechanisms are employed to +re-extract the subgraphs for regenerating global environments in a hierarchical +manner. In addition, we introduce a new learning objective that guides our +model to learn the diversity of environments within the same hierarchy while +maintaining consistency across different hierarchies. This approach enables our +model to consider the relationships between environments and facilitates robust +graph invariant learning. Extensive experiments on real-world graph data have +demonstrated the effectiveness of our framework. Particularly, in the +challenging dataset DrugOOD, our method achieves up to 1.29% and 2.83% +improvement over the best baselines on IC50 and EC50 prediction tasks, +respectively.",cs.LG,"['cs.LG', 'cs.AI']" +Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval,Haochen Han · Qinghua Zheng · Guang Dai · Minnan Luo · Jingdong Wang,https://github.com/hhc1997/L2RM,https://arxiv.org/abs/2403.05105,,2403.05105.pdf,Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval,"Collecting well-matched multimedia datasets is crucial for training +cross-modal retrieval models. However, in real-world scenarios, massive +multimodal data are harvested from the Internet, which inevitably contains +Partially Mismatched Pairs (PMPs). Undoubtedly, such semantical irrelevant data +will remarkably harm the cross-modal retrieval performance. Previous efforts +tend to mitigate this problem by estimating a soft correspondence to +down-weight the contribution of PMPs. In this paper, we aim to address this +challenge from a new perspective: the potential semantic similarity among +unpaired samples makes it possible to excavate useful knowledge from mismatched +pairs. To achieve this, we propose L2RM, a general framework based on Optimal +Transport (OT) that learns to rematch mismatched pairs. In detail, L2RM aims to +generate refined alignments by seeking a minimal-cost transport plan across +different modalities. To formalize the rematching idea in OT, first, we propose +a self-supervised cost function that automatically learns from explicit +similarity-cost mapping relation. Second, we present to model a partial OT +problem while restricting the transport among false positives to further boost +refined alignments. Extensive experiments on three benchmarks demonstrate our +L2RM significantly improves the robustness against PMPs for existing models. +The code is available at https://github.com/hhc1997/L2RM.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']" +Spatio-Temporal Turbulence Mitigation: A Translational Perspective,Xingguang Zhang · Nicholas M Chimitt · Yiheng Chi · Zhiyuan Mao · Stanley H. Chan, ,https://arxiv.org/abs/2401.04244,,2401.04244.pdf,Spatio-Temporal Turbulence Mitigation: A Translational Perspective,"Recovering images distorted by atmospheric turbulence is a challenging +inverse problem due to the stochastic nature of turbulence. Although numerous +turbulence mitigation (TM) algorithms have been proposed, their efficiency and +generalization to real-world dynamic scenarios remain severely limited. +Building upon the intuitions of classical TM algorithms, we present the Deep +Atmospheric TUrbulence Mitigation network (DATUM). DATUM aims to overcome major +challenges when transitioning from classical to deep learning approaches. By +carefully integrating the merits of classical multi-frame TM methods into a +deep network structure, we demonstrate that DATUM can efficiently perform +long-range temporal aggregation using a recurrent fashion, while deformable +attention and temporal-channel attention seamlessly facilitate pixel +registration and lucky imaging. With additional supervision, tilt and blur +degradation can be jointly mitigated. These inductive biases empower DATUM to +significantly outperform existing methods while delivering a tenfold increase +in processing speed. A large-scale training dataset, ATSyn, is presented as a +co-invention to enable generalization in real turbulence. Our code and datasets +are available at https://xg416.github.io/DATUM.",eess.IV,"['eess.IV', 'cs.CV']" +Seamless Human Motion Composition with Blended Positional Encodings,German Barquero · Sergio Escalera · Cristina Palmero,https://barquerogerman.github.io/FlowMDM/,https://arxiv.org/abs/2402.15509,,2402.15509.pdf,Seamless Human Motion Composition with Blended Positional Encodings,"Conditional human motion generation is an important topic with many +applications in virtual reality, gaming, and robotics. While prior works have +focused on generating motion guided by text, music, or scenes, these typically +result in isolated motions confined to short durations. Instead, we address the +generation of long, continuous sequences guided by a series of varying textual +descriptions. In this context, we introduce FlowMDM, the first diffusion-based +model that generates seamless Human Motion Compositions (HMC) without any +postprocessing or redundant denoising steps. For this, we introduce the Blended +Positional Encodings, a technique that leverages both absolute and relative +positional encodings in the denoising chain. More specifically, global motion +coherence is recovered at the absolute stage, whereas smooth and realistic +transitions are built at the relative stage. As a result, we achieve +state-of-the-art results in terms of accuracy, realism, and smoothness on the +Babel and HumanML3D datasets. FlowMDM excels when trained with only a single +description per motion sequence thanks to its Pose-Centric Cross-ATtention, +which makes it robust against varying text descriptions at inference time. +Finally, to address the limitations of existing HMC metrics, we propose two new +metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt +transitions.",cs.CV,['cs.CV'] +CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation,Lingjun Zhao · Jingyu Song · Katherine Skinner,https://song-jingyu.github.io/CRKD,https://arxiv.org/abs/2403.19104,,2403.19104.pdf,CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation,"In the field of 3D object detection for autonomous driving, LiDAR-Camera (LC) +fusion is the top-performing sensor configuration. Still, LiDAR is relatively +high cost, which hinders adoption of this technology for consumer automobiles. +Alternatively, camera and radar are commonly deployed on vehicles already on +the road today, but performance of Camera-Radar (CR) fusion falls behind LC +fusion. In this work, we propose Camera-Radar Knowledge Distillation (CRKD) to +bridge the performance gap between LC and CR detectors with a novel +cross-modality KD framework. We use the Bird's-Eye-View (BEV) representation as +the shared feature space to enable effective knowledge distillation. To +accommodate the unique cross-modality KD path, we propose four distillation +losses to help the student learn crucial features from the teacher model. We +present extensive evaluations on the nuScenes dataset to demonstrate the +effectiveness of the proposed CRKD framework. The project page for CRKD is +https://song-jingyu.github.io/CRKD.",cs.CV,"['cs.CV', 'cs.RO']" +RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception,Ruiyang Hao · Siqi Fan · Yingru Dai · Zhenlin Zhang · Chenxi Li · YuntianWang · Haibao Yu · Wenxian Yang · Jirui Yuan · Zaiqing Nie,https://github.com/AIR-THU/DAIR-RCooper,https://arxiv.org/abs/2403.10145,,2403.10145.pdf,RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception,"The value of roadside perception, which could extend the boundaries of +autonomous driving and traffic management, has gradually become more prominent +and acknowledged in recent years. However, existing roadside perception +approaches only focus on the single-infrastructure sensor system, which cannot +realize a comprehensive understanding of a traffic area because of the limited +sensing range and blind spots. Orienting high-quality roadside perception, we +need Roadside Cooperative Perception (RCooper) to achieve practical +area-coverage roadside perception for restricted traffic areas. Rcooper has its +own domain-specific challenges, but further exploration is hindered due to the +lack of datasets. We hence release the first real-world, large-scale RCooper +dataset to bloom the research on practical roadside cooperative perception, +including detection and tracking. The manually annotated dataset comprises 50k +images and 30k point clouds, including two representative traffic scenes (i.e., +intersection and corridor). The constructed benchmarks prove the effectiveness +of roadside cooperation perception and demonstrate the direction of further +research. Codes and dataset can be accessed at: +https://github.com/AIR-THU/DAIR-RCooper.",cs.CV,"['cs.CV', 'cs.RO', 'I.4.8; I.5.4']" +Scene Adaptive Sparse Transformer for Event-based Object Detection,Yansong Peng · Li Hebei · Yueyi Zhang · Xiaoyan Sun · Feng Wu, ,https://arxiv.org/abs/2404.01882,,2404.01882.pdf,Scene Adaptive Sparse Transformer for Event-based Object Detection,"While recent Transformer-based approaches have shown impressive performances +on event-based object detection tasks, their high computational costs still +diminish the low power consumption advantage of event cameras. Image-based +works attempt to reduce these costs by introducing sparse Transformers. +However, they display inadequate sparsity and adaptability when applied to +event-based object detection, since these approaches cannot balance the fine +granularity of token-level sparsification and the efficiency of window-based +Transformers, leading to reduced performance and efficiency. Furthermore, they +lack scene-specific sparsity optimization, resulting in information loss and a +lower recall rate. To overcome these limitations, we propose the Scene Adaptive +Sparse Transformer (SAST). SAST enables window-token co-sparsification, +significantly enhancing fault tolerance and reducing computational overhead. +Leveraging the innovative scoring and selection modules, along with the Masked +Sparse Window Self-Attention, SAST showcases remarkable scene-aware +adaptability: It focuses only on important objects and dynamically optimizes +sparsity level according to scene complexity, maintaining a remarkable balance +between performance and computational cost. The evaluation results show that +SAST outperforms all other dense and sparse networks in both performance and +efficiency on two large-scale event-based object detection datasets (1Mpx and +Gen1). Code: https://github.com/Peterande/SAST",cs.CV,['cs.CV'] +Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis,Yiyang Chen · Lunhao Duan · Shanshan Zhao · Changxing Ding · Dacheng Tao,https://github.com/wdttt/LocoTrans,https://arxiv.org/abs/2403.11113,,2403.11113.pdf,Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis,"Rotation invariance is an important requirement for point shape analysis. To +achieve this, current state-of-the-art methods attempt to construct the local +rotation-invariant representation through learning or defining the local +reference frame (LRF). Although efficient, these LRF-based methods suffer from +perturbation of local geometric relations, resulting in suboptimal local +rotation invariance. To alleviate this issue, we propose a Local-consistent +Transformation (LocoTrans) learning strategy. Specifically, we first construct +the local-consistent reference frame (LCRF) by considering the symmetry of the +two axes in LRF. In comparison with previous LRFs, our LCRF is able to preserve +local geometric relationships better through performing local-consistent +transformation. However, as the consistency only exists in local regions, the +relative pose information is still lost in the intermediate layers of the +network. We mitigate such a relative pose issue by developing a relative pose +recovery (RPR) module. RPR aims to restore the relative pose between adjacent +transformed patches. Equipped with LCRF and RPR, our LocoTrans is capable of +learning local-consistent transformation and preserving local geometry, which +benefits rotation invariance learning. Competitive performance under arbitrary +rotations on both shape classification and part segmentation tasks and +ablations can demonstrate the effectiveness of our method. Code will be +available publicly at https://github.com/wdttt/LocoTrans.",cs.CV,['cs.CV'] +Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning,Yun Li · Zhe Liu · Hang Chen · Lina Yao, ,https://arxiv.org/abs/2402.17251,,2402.17251.pdf,Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning,"Compositional Zero-Shot Learning (CZSL) aims to recognize unseen +attribute-object pairs based on a limited set of observed examples. Current +CZSL methodologies, despite their advancements, tend to neglect the distinct +specificity levels present in attributes. For instance, given images of sliced +strawberries, they may fail to prioritize `Sliced-Strawberry' over a generic +`Red-Strawberry', despite the former being more informative. They also suffer +from ballooning search space when shifting from Close-World (CW) to Open-World +(OW) CZSL. To address the issues, we introduce the Context-based and +Diversity-driven Specificity learning framework for CZSL (CDS-CZSL). Our +framework evaluates the specificity of attributes by considering the diversity +of objects they apply to and their related context. This novel approach allows +for more accurate predictions by emphasizing specific attribute-object pairs +and improves composition filtering in OW-CZSL. We conduct experiments in both +CW and OW scenarios, and our model achieves state-of-the-art results across +three datasets.",cs.CV,['cs.CV'] +MemFlow: Optical Flow Estimation and Prediction with Memory,Qiaole Dong · Yanwei Fu,https://dqiaole.github.io/MemFlow/,https://arxiv.org/abs/2404.04808,,2404.04808.pdf,MemFlow: Optical Flow Estimation and Prediction with Memory,"Optical flow is a classical task that is important to the vision community. +Classical optical flow estimation uses two frames as input, whilst some recent +methods consider multiple frames to explicitly model long-range information. +The former ones limit their ability to fully leverage temporal coherence along +the video sequence; and the latter ones incur heavy computational overhead, +typically not possible for real-time flow estimation. Some multi-frame-based +approaches even necessitate unseen future frames for current estimation, +compromising real-time applicability in safety-critical scenarios. To this end, +we present MemFlow, a real-time method for optical flow estimation and +prediction with memory. Our method enables memory read-out and update modules +for aggregating historical motion information in real-time. Furthermore, we +integrate resolution-adaptive re-scaling to accommodate diverse video +resolutions. Besides, our approach seamlessly extends to the future prediction +of optical flow based on past observations. Leveraging effective historical +motion aggregation, our method outperforms VideoFlow with fewer parameters and +faster inference speed on Sintel and KITTI-15 datasets in terms of +generalization performance. At the time of submission, MemFlow also leads in +performance on the 1080p Spring dataset. Codes and models will be available at: +https://dqiaole.github.io/MemFlow/.",cs.CV,['cs.CV'] +H-ViT: A Hierarchical Vision Transformer for Deformable Image Registration,Morteza Ghahremani · Mohammad Khateri · Bailiang Jian · Benedikt Wiestler · Ehsan Adeli · Christian Wachinger, ,https://arxiv.org/abs/2306.05688,,2306.05688.pdf,ModeT: Learning Deformable Image Registration via Motion Decomposition Transformer,"The Transformer structures have been widely used in computer vision and have +recently made an impact in the area of medical image registration. However, the +use of Transformer in most registration networks is straightforward. These +networks often merely use the attention mechanism to boost the feature learning +as the segmentation networks do, but do not sufficiently design to be adapted +for the registration task. In this paper, we propose a novel motion +decomposition Transformer (ModeT) to explicitly model multiple motion +modalities by fully exploiting the intrinsic capability of the Transformer +structure for deformation estimation. The proposed ModeT naturally transforms +the multi-head neighborhood attention relationship into the multi-coordinate +relationship to model multiple motion modes. Then the competitive weighting +module (CWM) fuses multiple deformation sub-fields to generate the resulting +deformation field. Extensive experiments on two public brain magnetic resonance +imaging (MRI) datasets show that our method outperforms current +state-of-the-art registration networks and Transformers, demonstrating the +potential of our ModeT for the challenging non-rigid deformation estimation +problem. The benchmarks and our code are publicly available at +https://github.com/ZAX130/SmileCode.",cs.CV,['cs.CV'] +Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields,Zhiyuan Min · Yawei Luo · Wei Yang · Yuesong Wang · Yi Yang,https://github.com/tatakai1/EVENeRF,https://arxiv.org/abs/2311.11845,,2311.11845.pdf,Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields,"Generalizable NeRF can directly synthesize novel views across new scenes, +eliminating the need for scene-specific retraining in vanilla NeRF. A critical +enabling factor in these approaches is the extraction of a generalizable 3D +representation by aggregating source-view features. In this paper, we propose +an Entangled View-Epipolar Information Aggregation method dubbed EVE-NeRF. +Different from existing methods that consider cross-view and along-epipolar +information independently, EVE-NeRF conducts the view-epipolar feature +aggregation in an entangled manner by injecting the scene-invariant appearance +continuity and geometry consistency priors to the aggregation process. Our +approach effectively mitigates the potential lack of inherent geometric and +appearance constraint resulting from one-dimensional interactions, thus further +boosting the 3D representation generalizablity. EVE-NeRF attains +state-of-the-art performance across various evaluation scenarios. Extensive +experiments demonstate that, compared to prevailing single-dimensional +aggregation, the entangled network excels in the accuracy of 3D scene geometry +and appearance reconstruction. Our code is publicly available at +https://github.com/tatakai1/EVENeRF.",cs.CV,['cs.CV'] +Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective,Jinjing Zhao · Fangyun Wei · Chang Xu,https://github.com/ZhaoJingjing713/HPR,,,,,,,nan +Hyperbolic Anomaly Detection,Huimin Li · Zhentao Chen · Yunhao Xu · Junlin Hu, ,https://arxiv.org/abs/2403.20236,,2403.20236.pdf,Long-Tailed Anomaly Detection with Learnable Class Names,"Anomaly detection (AD) aims to identify defective images and localize their +defects (if any). Ideally, AD models should be able to detect defects over many +image classes; without relying on hard-coded class names that can be +uninformative or inconsistent across datasets; learn without anomaly +supervision; and be robust to the long-tailed distributions of real-world +applications. To address these challenges, we formulate the problem of +long-tailed AD by introducing several datasets with different levels of class +imbalance and metrics for performance evaluation. We then propose a novel +method, LTAD, to detect defects from multiple and long-tailed classes, without +relying on dataset class names. LTAD combines AD by reconstruction and semantic +AD modules. AD by reconstruction is implemented with a transformer-based +reconstruction module. Semantic AD is implemented with a binary classifier, +which relies on learned pseudo class names and a pretrained foundation model. +These modules are learned over two phases. Phase 1 learns the pseudo-class +names and a variational autoencoder (VAE) for feature synthesis that augments +the training data to combat long-tails. Phase 2 then learns the parameters of +the reconstruction and classification modules of LTAD. Extensive experiments +using the proposed long-tailed datasets show that LTAD substantially +outperforms the state-of-the-art methods for most forms of dataset imbalance. +The long-tailed dataset split is available at +https://zenodo.org/records/10854201 .",cs.CV,['cs.CV'] +HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models,Tianrui Guan · Fuxiao Liu · Xiyang Wu · Ruiqi Xian · Zongxia Li · Xiaoyu Liu · Xijun Wang · Lichang Chen · Furong Huang · Yaser Yacoob · Dinesh Manocha · Tianyi Zhou,https://github.com/tianyi-lab/HallusionBench,https://arxiv.org/abs/2310.14566,,2310.14566.pdf,HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models,"We introduce HallusionBench, a comprehensive benchmark designed for the +evaluation of image-context reasoning. This benchmark presents significant +challenges to advanced large visual-language models (LVLMs), such as +GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing +nuanced understanding and interpretation of visual data. The benchmark +comprises 346 images paired with 1129 questions, all meticulously crafted by +human experts. We introduce a novel structure for these visual questions +designed to establish control groups. This structure enables us to conduct a +quantitative analysis of the models' response tendencies, logical consistency, +and various failure modes. In our evaluation on HallusionBench, we benchmarked +15 different models, highlighting a 31.42% question-pair accuracy achieved by +the state-of-the-art GPT-4V. Notably, all other evaluated models achieve +accuracy below 16%. Moreover, our analysis not only highlights the observed +failure modes, including language hallucination and visual illusion, but also +deepens an understanding of these pitfalls. Our comprehensive case studies +within HallusionBench shed light on the challenges of hallucination and +illusion in LVLMs. Based on these insights, we suggest potential pathways for +their future improvement. The benchmark and codebase can be accessed at +https://github.com/tianyi-lab/HallusionBench.",cs.CV,"['cs.CV', 'cs.CL']" +Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization,Ioanna Ntinou · Enrique Sanchez · Georgios Tzimiropoulos,https://github.com/IoannaNti/BMViT,https://arxiv.org/abs/2312.17686,,2312.17686.pdf,Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization,"Action Localization is a challenging problem that combines detection and +recognition tasks, which are often addressed separately. State-of-the-art +methods rely on off-the-shelf bounding box detections pre-computed at high +resolution, and propose transformer models that focus on the classification +task alone. Such two-stage solutions are prohibitive for real-time deployment. +On the other hand, single-stage methods target both tasks by devoting part of +the network (generally the backbone) to sharing the majority of the workload, +compromising performance for speed. These methods build on adding a DETR head +with learnable queries that after cross- and self-attention can be sent to +corresponding MLPs for detecting a person's bounding box and action. However, +DETR-like architectures are challenging to train and can incur in big +complexity. + In this paper, we observe that \textbf{a straight bipartite matching loss can +be applied to the output tokens of a vision transformer}. This results in a +backbone + MLP architecture that can do both tasks without the need of an extra +encoder-decoder head and learnable queries. We show that a single MViTv2-S +architecture trained with bipartite matching to perform both tasks surpasses +the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. +With a careful design of token pooling and the proposed training pipeline, our +Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on +AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at +\href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}",cs.CV,['cs.CV'] +UnO: Unsupervised Occupancy Fields for Perception and Forecasting,Ben Agro · Quinlan Sykora · Sergio Casas · Thomas Gilles · Raquel Urtasun, ,https://arxiv.org/abs/2308.01471,,2308.01471.pdf,Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving,"A self-driving vehicle (SDV) must be able to perceive its surroundings and +predict the future behavior of other traffic participants. Existing works +either perform object detection followed by trajectory forecasting of the +detected objects, or predict dense occupancy and flow grids for the whole +scene. The former poses a safety concern as the number of detections needs to +be kept low for efficiency reasons, sacrificing object recall. The latter is +computationally expensive due to the high-dimensionality of the output grid, +and suffers from the limited receptive field inherent to fully convolutional +networks. Furthermore, both approaches employ many computational resources +predicting areas or objects that might never be queried by the motion planner. +This motivates our unified approach to perception and future prediction that +implicitly represents occupancy and flow over time with a single neural +network. Our method avoids unnecessary computation, as it can be directly +queried by the motion planner at continuous spatio-temporal locations. +Moreover, we design an architecture that overcomes the limited receptive field +of previous explicit occupancy prediction methods by adding an efficient yet +effective global attention mechanism. Through extensive experiments in both +urban and highway settings, we demonstrate that our implicit model outperforms +the current state-of-the-art. For more information, visit the project website: +https://waabi.ai/research/implicito.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']" +Do Vision and Language Encoders Represent the World Similarly?,Mayug Maniparambil · Raiymbek Akshulakov · YASSER ABDELAZIZ DAHOU DJILALI · Mohamed El Amine Seddik · Sanath Narayan · Karttikeya Mangalam · Noel O'Connor,https://github.com/mayug/0-shot-llm-vision,https://arxiv.org/abs/2401.05224,,2401.05224.pdf,Do Vision and Language Encoders Represent the World Similarly?,"Aligned text-image encoders such as CLIP have become the de facto model for +vision-language tasks. Furthermore, modality-specific encoders achieve +impressive performances in their respective domains. This raises a central +question: does an alignment exist between uni-modal vision and language +encoders since they fundamentally represent the same physical world? Analyzing +the latent spaces structure of vision and language models on image-caption +benchmarks using the Centered Kernel Alignment (CKA), we find that the +representation spaces of unaligned and aligned encoders are semantically +similar. In the absence of statistical similarity in aligned encoders like +CLIP, we show that a possible matching of unaligned encoders exists without any +training. We frame this as a seeded graph-matching problem exploiting the +semantic similarity between graphs and propose two methods - a Fast Quadratic +Assignment Problem optimization, and a novel localized CKA metric-based +matching/retrieval. We demonstrate the effectiveness of this on several +downstream tasks including cross-lingual, cross-domain caption matching and +image classification. Code available at github.com/mayug/0-shot-llm-vision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +You Only Need Less Attention Each Stage in Vision Transformers,Shuoxi Zhang · Hanpeng Liu · Stephen Lin · Kun He, ,,,,,,,nan +DemoCaricature: Democratising Caricature Generation with a Rough Sketch,Dar-Yen Chen · Ayan Kumar Bhunia · Subhadeep Koley · Aneeshan Sain · Pinaki Nath Chowdhury · Yi-Zhe Song,https://democaricature.github.io/,https://arxiv.org/abs/2312.04364v1,,2312.04364v1.pdf,DemoCaricature: Democratising Caricature Generation with a Rough Sketch,"In this paper, we democratise caricature generation, empowering individuals +to effortlessly craft personalised caricatures with just a photo and a +conceptual sketch. Our objective is to strike a delicate balance between +abstraction and identity, while preserving the creativity and subjectivity +inherent in a sketch. To achieve this, we present Explicit Rank-1 Model Editing +alongside single-image personalisation, selectively applying nuanced edits to +cross-attention layers for a seamless merge of identity and style. +Additionally, we propose Random Mask Reconstruction to enhance robustness, +directing the model to focus on distinctive identity and style features. +Crucially, our aim is not to replace artists but to eliminate accessibility +barriers, allowing enthusiasts to engage in the artistry.",cs.CV,['cs.CV'] +Prompt Highlighter: Interactive Control for Multi-Modal LLMs,Yuechen Zhang · Shengju Qian · Bohao Peng · Shu Liu · Jiaya Jia,https://github.com/dvlab-research/Prompt-Highlighter,https://arxiv.org/abs/2312.04302,,2312.04302.pdf,Prompt Highlighter: Interactive Control for Multi-Modal LLMs,"This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) +inference: explicit controllable text generation. Multi-modal LLMs empower +multi-modality understanding with the capability of semantic generation yet +bring less explainability and heavier reliance on prompt contents due to their +autoregressive generative nature. While manipulating prompt formats could +improve outputs, designing specific and precise prompts per task can be +challenging and ineffective. To tackle this issue, we introduce a novel +inference method, Prompt Highlighter, which enables users to highlight specific +prompt spans to interactively control the focus during generation. Motivated by +the classifier-free diffusion guidance, we form regular and unconditional +context pairs based on highlighted tokens, demonstrating that the +autoregressive generation in models can be guided in a classifier-free way. +Notably, we find that, during inference, guiding the models with highlighted +tokens through the attention weights leads to more desired outputs. Our +approach is compatible with current LLMs and VLMs, achieving impressive +customized generation results without training. Experiments confirm its +effectiveness in focusing on input contexts and generating reliable content. +Without tuning on LLaVA-v1.5, our method secured 70.7 in the MMBench test and +1552.5 in MME-perception. The code is available at: +https://github.com/dvlab-research/Prompt-Highlighter/",cs.CV,"['cs.CV', 'cs.CL']" +Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting,Haiwei Chen · Yajie Zhao, ,https://arxiv.org/abs/2403.18186,,2403.18186.pdf,Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting,"We present a method for large-mask pluralistic image inpainting based on the +generative framework of discrete latent codes. Our method learns latent priors, +discretized as tokens, by only performing computations at the visible locations +of the image. This is realized by a restrictive partial encoder that predicts +the token label for each visible block, a bidirectional transformer that infers +the missing labels by only looking at these tokens, and a dedicated synthesis +network that couples the tokens with the partial image priors to generate +coherent and pluralistic complete image even under extreme mask settings. +Experiments on public benchmarks validate our design choices as the proposed +method outperforms strong baselines in both visual quality and diversity +metrics.",cs.CV,['cs.CV'] +Bayesian Differentiable Physics for Cloth Digitalization,Deshan Gong · Ningtao Mao · He Wang, ,https://arxiv.org/abs/2402.17664,,2402.17664.pdf,Bayesian Differentiable Physics for Cloth Digitalization,"We propose a new method for cloth digitalization. Deviating from existing +methods which learn from data captured under relatively casual settings, we +propose to learn from data captured in strictly tested measuring protocols, and +find plausible physical parameters of the cloths. However, such data is +currently absent, so we first propose a new dataset with accurate cloth +measurements. Further, the data size is considerably smaller than the ones in +current deep learning, due to the nature of the data capture process. To learn +from small data, we propose a new Bayesian differentiable cloth model to +estimate the complex material heterogeneity of real cloths. It can provide +highly accurate digitalization from very limited data samples. Through +exhaustive evaluation and comparison, we show our method is accurate in cloth +digitalization, efficient in learning from limited data samples, and general in +capturing material variations. Code and data are available +https://github.com/realcrane/Bayesian-Differentiable-Physics-for-Cloth-Digitalization",cs.CV,"['cs.CV', 'F.4.8; I.6.8']" +Few-Shot Object Detection with Foundation Models,Guangxing Han · Ser-Nam Lim, ,https://arxiv.org/abs/2312.14494,,2312.14494.pdf,Revisiting Few-Shot Object Detection with Vision-Language Models,"Few-shot object detection (FSOD) benchmarks have advanced techniques for +detecting new categories with limited annotations. Existing benchmarks +repurpose well-established datasets like COCO by partitioning categories into +base and novel classes for pre-training and fine-tuning respectively. However, +these benchmarks do not reflect how FSOD is deployed in practice. Rather than +only pre-training on a small number of base categories, we argue that it is +more practical to fine-tune a foundation model (e.g., a vision-language model +(VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find +that zero-shot inference from VLMs like GroundingDINO significantly outperforms +the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models +can still be misaligned to target concepts of interest. For example, trailers +on the web may be different from trailers in the context of autonomous +vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol +that evaluates detectors pre-trained on any external datasets and fine-tuned on +K-shots per target class. Further, we note that current FSOD benchmarks are +actually federated datasets containing exhaustive annotations for each category +on a subset of the data. We leverage this insight to propose simple strategies +for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of +our approach on LVIS and nuImages, improving over prior work by 5.9 AP. Our +code is available at https://github.com/anishmadan23/foundational_fsod",cs.CV,['cs.CV'] +MonoHair: High-Fidelity Hair Modeling from a Monocular Video,Keyu Wu · LINGCHEN YANG · Zhiyi Kuang · Yao Feng · Xutao Han · Yuefan Shen · Hongbo Fu · Kun Zhou · Youyi Zheng,https://keyuwu-cs.github.io/MonoHair/,https://arxiv.org/abs/2403.18356,,2403.18356.pdf,MonoHair: High-Fidelity Hair Modeling from a Monocular Video,"Undoubtedly, high-fidelity 3D hair is crucial for achieving realism, artistic +expression, and immersion in computer graphics. While existing 3D hair modeling +methods have achieved impressive performance, the challenge of achieving +high-quality hair reconstruction persists: they either require strict capture +conditions, making practical applications difficult, or heavily rely on learned +prior data, obscuring fine-grained details in images. To address these +challenges, we propose MonoHair,a generic framework to achieve high-fidelity +hair reconstruction from a monocular video, without specific requirements for +environments. Our approach bifurcates the hair modeling process into two main +stages: precise exterior reconstruction and interior structure inference. The +exterior is meticulously crafted using our Patch-based Multi-View Optimization +(PMVO). This method strategically collects and integrates hair information from +multiple views, independent of prior data, to produce a high-fidelity exterior +3D line map. This map not only captures intricate details but also facilitates +the inference of the hair's inner structure. For the interior, we employ a +data-driven, multi-view 3D hair reconstruction method. This method utilizes 2D +structural renderings derived from the reconstructed exterior, mirroring the +synthetic 2D inputs used during training. This alignment effectively bridges +the domain gap between our training data and real-world data, thereby enhancing +the accuracy and reliability of our interior structure inference. Lastly, we +generate a strand model and resolve the directional ambiguity by our hair +growth algorithm. Our experiments demonstrate that our method exhibits +robustness across diverse hairstyles and achieves state-of-the-art performance. +For more results, please refer to our project page +https://keyuwu-cs.github.io/MonoHair/.",cs.CV,['cs.CV'] +Solving Masked Jigsaw Puzzles with Diffusion Transformers,Jinyang Liu · Wondmgezahu Teshome · Sandesh Ghimire · Mario Sznaier · Octavia Camps, ,https://arxiv.org/abs/2404.07292,,2404.07292.pdf,Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers,"Solving image and video jigsaw puzzles poses the challenging task of +rearranging image fragments or video frames from unordered sequences to restore +meaningful images and video sequences. Existing approaches often hinge on +discriminative models tasked with predicting either the absolute positions of +puzzle elements or the permutation actions applied to the original data. +Unfortunately, these methods face limitations in effectively solving puzzles +with a large number of elements. In this paper, we propose JPDVT, an innovative +approach that harnesses diffusion transformers to address this challenge. +Specifically, we generate positional information for image patches or video +frames, conditioned on their underlying visual content. This information is +then employed to accurately assemble the puzzle pieces in their correct +positions, even in scenarios involving missing pieces. Our method achieves +state-of-the-art performance on several datasets.",cs.CV,['cs.CV'] +Shadow-Enlightened Image Outpainting,Hang Yu · Ruilin Li · Shaorong Xie · Jiayan Qiu, ,https://arxiv.org/html/2204.08563v2,,2204.08563v2.pdf,Cylin-Painting: Seamless {360\textdegree} Panoramic Image Outpainting and Beyond,"Image outpainting gains increasing attention since it can generate the +complete scene from a partial view, providing a valuable solution to construct +{360\textdegree} panoramic images. As image outpainting suffers from the +intrinsic issue of unidirectional completion flow, previous methods convert the +original problem into inpainting, which allows a bidirectional flow. However, +we find that inpainting has its own limitations and is inferior to outpainting +in certain situations. The question of how they may be combined for the best of +both has as yet remained under-explored. In this paper, we provide a deep +analysis of the differences between inpainting and outpainting, which +essentially depends on how the source pixels contribute to the unknown regions +under different spatial arrangements. Motivated by this analysis, we present a +Cylin-Painting framework that involves meaningful collaborations between +inpainting and outpainting and efficiently fuses the different arrangements, +with a view to leveraging their complementary benefits on a seamless cylinder. +Nevertheless, straightforwardly applying the cylinder-style convolution often +generates visually unpleasing results as it discards important positional +information. To address this issue, we further present a learnable positional +embedding strategy to incorporate the missing component of positional encoding +into the cylinder convolution, which significantly improves the panoramic +results. It is noted that while developed for image outpainting, the proposed +algorithm can be effectively extended to other panoramic vision tasks, such as +object detection, depth estimation, and image super-resolution. Code will be +made available at \url{https://github.com/KangLiao929/Cylin-Painting}.",cs.CV,['cs.CV'] +Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking,Wei Cao · Chang Luo · Biao Zhang · Matthias Nießner · Jiapeng Tang, ,https://arxiv.org/abs/2401.06614,,2401.06614.pdf,Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking,"We introduce Motion2VecSets, a 4D diffusion model for dynamic surface +reconstruction from point cloud sequences. While existing state-of-the-art +methods have demonstrated success in reconstructing non-rigid objects using +neural field representations, conventional feed-forward networks encounter +challenges with ambiguous observations from noisy, partial, or sparse point +clouds. To address these challenges, we introduce a diffusion model that +explicitly learns the shape and motion distribution of non-rigid objects +through an iterative denoising process of compressed latent representations. +The diffusion-based priors enable more plausible and probabilistic +reconstructions when handling ambiguous inputs. We parameterize 4D dynamics +with latent sets instead of using global latent codes. This novel 4D +representation allows us to learn local shape and deformation patterns, leading +to more accurate non-linear motion capture and significantly improving +generalizability to unseen motions and identities. For more temporally-coherent +object tracking, we synchronously denoise deformation latent sets and exchange +information across multiple frames. To avoid computational overhead, we +designed an interleaved space and time attention block to alternately aggregate +deformation latents along spatial and temporal domains. Extensive comparisons +against state-of-the-art methods demonstrate the superiority of our +Motion2VecSets in 4D reconstruction from various imperfect observations. More +detailed information can be found at +https://vveicao.github.io/projects/Motion2VecSets/.",cs.CV,['cs.CV'] +Test-Time Linear Out-of-Distribution Detection,Ke Fan · Tong Liu · Xingyu Qiu · Yikai Wang · Lian Huai · Zeyu Shangguan · Shuang Gou · FENGJIAN LIU · Yuqian Fu · Yanwei Fu · Xingqun Jiang, ,https://arxiv.org/abs/2311.16420,,2311.16420.pdf,Model-free Test Time Adaptation for Out-Of-Distribution Detection,"Out-of-distribution (OOD) detection is essential for the reliability of ML +models. Most existing methods for OOD detection learn a fixed decision +criterion from a given in-distribution dataset and apply it universally to +decide if a data point is OOD. Recent work~\cite{fang2022is} shows that given +only in-distribution data, it is impossible to reliably detect OOD data without +extra assumptions. Motivated by the theoretical result and recent exploration +of test-time adaptation methods, we propose a Non-Parametric Test Time +\textbf{Ada}ptation framework for \textbf{O}ut-Of-\textbf{D}istribution +\textbf{D}etection (\abbr). Unlike conventional methods, \abbr utilizes online +test samples for model adaptation during testing, enhancing adaptability to +changing data distributions. The framework incorporates detected OOD instances +into decision-making, reducing false positive rates, particularly when ID and +OOD distributions overlap significantly. We demonstrate the effectiveness of +\abbr through comprehensive experiments on multiple OOD detection benchmarks, +extensive empirical studies show that \abbr significantly improves the +performance of OOD detection over state-of-the-art methods. Specifically, \abbr +reduces the false positive rate (FPR95) by $23.23\%$ on the CIFAR-10 benchmarks +and $38\%$ on the ImageNet-1k benchmarks compared to the advanced methods. +Lastly, we theoretically verify the effectiveness of \abbr.",cs.LG,"['cs.LG', 'cs.CV']" +Spatial-Aware Regression for Keypoint Localization,Dongkai Wang · Shiliang Zhang, ,,https://dl.acm.org/doi/10.1145/3581783.3611989,,,,,nan +Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving,JunDa Cheng · Wei Yin · Kaixuan Wang · Xiaozhi Chen · Shijie Wang · Xin Yang, ,https://arxiv.org/abs/2403.07535,,2403.07535.pdf,Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving,"Multi-view depth estimation has achieved impressive performance over various +benchmarks. However, almost all current multi-view systems rely on given ideal +camera poses, which are unavailable in many real-world scenarios, such as +autonomous driving. In this work, we propose a new robustness benchmark to +evaluate the depth estimation system under various noisy pose settings. +Surprisingly, we find current multi-view depth estimation methods or +single-view and multi-view fusion methods will fail when given noisy pose +settings. To address this challenge, we propose a single-view and multi-view +fused depth estimation system, which adaptively integrates high-confident +multi-view and single-view results for both robust and accurate depth +estimations. The adaptive fusion module performs fusion by dynamically +selecting high-confidence regions between two branches based on a wrapping +confidence map. Thus, the system tends to choose the more reliable branch when +facing textureless scenes, inaccurate calibration, dynamic objects, and other +degradation or challenging conditions. Our method outperforms state-of-the-art +multi-view and fusion methods under robustness testing. Furthermore, we achieve +state-of-the-art performance on challenging benchmarks (KITTI and DDAD) when +given accurate pose estimations. Project website: +https://github.com/Junda24/AFNet/.",cs.CV,['cs.CV'] +ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection,Yichen Bai · Zongbo Han · Bing Cao · Xiaoheng Jiang · Qinghua Hu · Changqing Zhang, ,https://arxiv.org/abs/2311.15243,,2311.15243.pdf,ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection,"Out-of-distribution (OOD) detection methods often exploit auxiliary outliers +to train model identifying OOD samples, especially discovering challenging +outliers from auxiliary outliers dataset to improve OOD detection. However, +they may still face limitations in effectively distinguishing between the most +challenging OOD samples that are much like in-distribution (ID) data, i.e., +\idlike samples. To this end, we propose a novel OOD detection framework that +discovers \idlike outliers using CLIP \cite{DBLP:conf/icml/RadfordKHRGASAM21} +from the vicinity space of the ID samples, thus helping to identify these most +challenging OOD samples. Then a prompt learning framework is proposed that +utilizes the identified \idlike outliers to further leverage the capabilities +of CLIP for OOD detection. Benefiting from the powerful CLIP, we only need a +small number of ID samples to learn the prompts of the model without exposing +other auxiliary outlier datasets. By focusing on the most challenging \idlike +OOD samples and elegantly exploiting the capabilities of CLIP, our method +achieves superior few-shot learning performance on various real-world image +datasets (e.g., in 4-shot OOD detection on the ImageNet-1k dataset, our method +reduces the average FPR95 by 12.16\% and improves the average AUROC by 2.76\%, +compared to state-of-the-art methods). Code is available at +https://github.com/ycfate/ID-like.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis,Feng Liang · Bichen Wu · Jialiang Wang · Licheng Yu · Kunpeng Li · Yinan Zhao · Ishan Misra · Jia-Bin Huang · Peizhao Zhang · Peter Vajda · Diana Marculescu, ,https://arxiv.org/abs/2312.17681,,2312.17681.pdf,FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis,"Diffusion models have transformed the image-to-image (I2I) synthesis and are +now permeating into videos. However, the advancement of video-to-video (V2V) +synthesis has been hampered by the challenge of maintaining temporal +consistency across video frames. This paper proposes a consistent V2V synthesis +framework by jointly leveraging spatial conditions and temporal optical flow +clues within the source video. Contrary to prior methods that strictly adhere +to optical flow, our approach harnesses its benefits while handling the +imperfection in flow estimation. We encode the optical flow via warping from +the first frame and serve it as a supplementary reference in the diffusion +model. This enables our model for video synthesis by editing the first frame +with any prevalent I2I models and then propagating edits to successive frames. +Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: +FlowVid works seamlessly with existing I2I models, facilitating various +modifications, including stylization, object swaps, and local edits. (2) +Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution +takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, +Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our +FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender +(10.2%), and TokenFlow (40.4%).",cs.CV,"['cs.CV', 'cs.MM']" +Prompt-enhanced Multiple Instance Learning for Weakly Supervised Anomaly Detection,Junxi Chen · Liang Li · Li Su · Zheng-Jun Zha · Qingming Huang, ,https://arxiv.org/abs/2306.14451,,2306.14451.pdf,Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection,"Video anomaly detection under weak supervision presents significant +challenges, particularly due to the lack of frame-level annotations during +training. While prior research has utilized graph convolution networks and +self-attention mechanisms alongside multiple instance learning (MIL)-based +classification loss to model temporal relations and learn discriminative +features, these methods often employ multi-branch architectures to capture +local and global dependencies separately, resulting in increased parameters and +computational costs. Moreover, the coarse-grained interclass separability +provided by the binary constraint of MIL-based loss neglects the fine-grained +discriminability within anomalous classes. In response, this paper introduces a +weakly supervised anomaly detection framework that focuses on efficient context +modeling and enhanced semantic discriminability. We present a Temporal Context +Aggregation (TCA) module that captures comprehensive contextual information by +reusing the similarity matrix and implementing adaptive fusion. Additionally, +we propose a Prompt-Enhanced Learning (PEL) module that integrates semantic +priors using knowledge-based prompts to boost the discriminative capacity of +context features while ensuring separability between anomaly sub-classes. +Extensive experiments validate the effectiveness of our method's components, +demonstrating competitive performance with reduced parameters and computational +effort on three challenging benchmarks: UCF-Crime, XD-Violence, and +ShanghaiTech datasets. Notably, our approach significantly improves the +detection accuracy of certain anomaly sub-classes, underscoring its practical +value and efficacy. Our code is available at: +https://github.com/yujiangpu20/PEL4VAD.",cs.CV,['cs.CV'] +Efficient Dataset Distillation via Minimax Diffusion,Jianyang Gu · Saeed Vahidian · Vyacheslav Kungurtsev · Haonan Wang · Wei Jiang · Yang You · Yiran Chen,https://github.com/vimar-gu/MinimaxDiffusion,https://arxiv.org/abs/2311.15529v1,,2311.15529v1.pdf,Efficient Dataset Distillation via Minimax Diffusion,"Dataset distillation reduces the storage and computational consumption of +training a network by generating a small surrogate dataset that encapsulates +rich information of the original large-scale one. However, previous +distillation methods heavily rely on the sample-wise iterative optimization +scheme. As the images-per-class (IPC) setting or image resolution grows larger, +the necessary computation will demand overwhelming time and resources. In this +work, we intend to incorporate generative diffusion techniques for computing +the surrogate dataset. Observing that key factors for constructing an effective +surrogate dataset are representativeness and diversity, we design additional +minimax criteria in the generative training to enhance these facets for the +generated images of diffusion models. We present a theoretical model of the +process as hierarchical diffusion control demonstrating the flexibility of the +diffusion process to target these criteria without jeopardizing the +faithfulness of the sample to the desired distribution. The proposed method +achieves state-of-the-art validation performance while demanding much less +computational resources. Under the 100-IPC setting on ImageWoof, our method +requires less than one-twentieth the distillation time of previous methods, yet +yields even better performance. Source code available in +https://github.com/vimar-gu/MinimaxDiffusion.",cs.CV,['cs.CV'] +State Space Models for Event Cameras,Nikola Zubic · Mathias Gehrig · Davide Scaramuzza,https://github.com/uzh-rpg/ssms_event_cameras,https://arxiv.org/abs/2402.15584,,2402.15584.pdf,State Space Models for Event Cameras,"Today, state-of-the-art deep neural networks that process event-camera data +first convert a temporal window of events into dense, grid-like input +representations. As such, they exhibit poor generalizability when deployed at +higher inference frequencies (i.e., smaller temporal windows) than the ones +they were trained on. We address this challenge by introducing state-space +models (SSMs) with learnable timescale parameters to event-based vision. This +design adapts to varying frequencies without the need to retrain the network at +different frequencies. Additionally, we investigate two strategies to +counteract aliasing effects when deploying the model at higher frequencies. We +comprehensively evaluate our approach against existing methods based on RNN and +Transformer architectures across various benchmarks, including Gen1 and 1 Mpx +event camera datasets. Our results demonstrate that SSM-based models train 33% +faster and also exhibit minimal performance degradation when tested at higher +frequencies than the training input. Traditional RNN and Transformer models +exhibit performance drops of more than 20 mAP, with SSMs having a drop of 3.76 +mAP, highlighting the effectiveness of SSMs in event-based vision tasks.",cs.CV,"['cs.CV', 'cs.LG']" +ReCoRe: Regularized Contrastive Representation Learning of World Model,"Rudra P,K. Poudel · Harit Pandya · Stephan Liwicki · Roberto Cipolla",https://www.toshiba.eu/pages/eu/Cambridge-Research-Laboratory/world_models,https://arxiv.org/abs/2312.09056v1,,2312.09056v1.pdf,ReCoRe: Regularized Contrastive Representation Learning of World Model,"While recent model-free Reinforcement Learning (RL) methods have demonstrated +human-level effectiveness in gaming environments, their success in everyday +tasks like visual navigation has been limited, particularly under significant +appearance variations. This limitation arises from (i) poor sample efficiency +and (ii) over-fitting to training scenarios. To address these challenges, we +present a world model that learns invariant features using (i) contrastive +unsupervised learning and (ii) an intervention-invariant regularizer. Learning +an explicit representation of the world dynamics i.e. a world model, improves +sample efficiency while contrastive learning implicitly enforces learning of +invariant features, which improves generalization. However, the naive +integration of contrastive loss to world models fails due to a lack of +supervisory signals to the visual encoder, as world-model-based RL methods +independently optimize representation learning and agent policy. To overcome +this issue, we propose an intervention-invariant regularizer in the form of an +auxiliary task such as depth prediction, image denoising, etc., that explicitly +enforces invariance to style-interventions. Our method outperforms current +state-of-the-art model-based and model-free RL methods and significantly on +out-of-distribution point navigation task evaluated on the iGibson benchmark. +We further demonstrate that our approach, with only visual observations, +outperforms recent language-guided foundation models for point navigation, +which is essential for deployment on robots with limited computation +capabilities. Finally, we demonstrate that our proposed model excels at the +sim-to-real transfer of its perception module on Gibson benchmark.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV', 'cs.RO', 'stat.ML']" +Semantic Human Mesh Reconstruction with Textures,xiaoyu zhan · Jianxin Yang · Yuanqi Li · Jie Guo · Yanwen Guo · Wenping Wang,https://zhanxy.xyz/projects/shert,https://arxiv.org/abs/2403.02561,,2403.02561.pdf,Semantic Human Mesh Reconstruction with Textures,"The field of 3D detailed human mesh reconstruction has made significant +progress in recent years. However, current methods still face challenges when +used in industrial applications due to unstable results, low-quality meshes, +and a lack of UV unwrapping and skinning weights. In this paper, we present +SHERT, a novel pipeline that can reconstruct semantic human meshes with +textures and high-precision details. SHERT applies semantic- and normal-based +sampling between the detailed surface (e.g. mesh and SDF) and the corresponding +SMPL-X model to obtain a partially sampled semantic mesh and then generates the +complete semantic mesh by our specifically designed self-supervised completion +and refinement networks. Using the complete semantic mesh as a basis, we employ +a texture diffusion model to create human textures that are driven by both +images and texts. Our reconstructed meshes have stable UV unwrapping, +high-quality triangle meshes, and consistent semantic information. The given +SMPL-X model provides semantic information and shape priors, allowing SHERT to +perform well even with incorrect and incomplete inputs. The semantic +information also makes it easy to substitute and animate different body parts +such as the face, body, and hands. Quantitative and qualitative experiments +demonstrate that SHERT is capable of producing high-fidelity and robust +semantic meshes that outperform state-of-the-art methods.",cs.CV,['cs.CV'] +Infrared Small Target Detection with Scale and Location Sensitivity,Qiankun Liu · Rui Liu · Bolun Zheng · Hongkui Wang · Ying Fu, ,https://arxiv.org/abs/2403.19366,,2403.19366.pdf,Infrared Small Target Detection with Scale and Location Sensitivity,"Recently, infrared small target detection (IRSTD) has been dominated by +deep-learning-based methods. However, these methods mainly focus on the design +of complex model structures to extract discriminative features, leaving the +loss functions for IRSTD under-explored. For example, the widely used +Intersection over Union (IoU) and Dice losses lack sensitivity to the scales +and locations of targets, limiting the detection performance of detectors. In +this paper, we focus on boosting detection performance with a more effective +loss but a simpler model structure. Specifically, we first propose a novel +Scale and Location Sensitive (SLS) loss to handle the limitations of existing +losses: 1) for scale sensitivity, we compute a weight for the IoU loss based on +target scales to help the detector distinguish targets with different scales: +2) for location sensitivity, we introduce a penalty term based on the center +points of targets to help the detector localize targets more precisely. Then, +we design a simple Multi-Scale Head to the plain U-Net (MSHNet). By applying +SLS loss to each scale of the predictions, our MSHNet outperforms existing +state-of-the-art methods by a large margin. In addition, the detection +performance of existing detectors can be further improved when trained with our +SLS loss, demonstrating the effectiveness and generalization of our SLS loss. +The code is available at https://github.com/ying-fu/MSHNet.",cs.CV,['cs.CV'] +Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships,Sebastian Koch · Narunas Vaskevicius · Mirco Colosi · Pedro Hermosilla · Timo Ropinski,https://kochsebastian.com/open3dsg,https://arxiv.org/abs/2402.12259,,2402.12259.pdf,Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships,"Current approaches for 3D scene graph prediction rely on labeled datasets to +train models for a fixed set of known object classes and relationship +categories. We present Open3DSG, an alternative approach to learn 3D scene +graph prediction in an open world without requiring labeled scene graph data. +We co-embed the features from a 3D scene graph prediction backbone with the +feature space of powerful open world 2D vision language foundation models. This +enables us to predict 3D scene graphs from 3D point clouds in a zero-shot +manner by querying object classes from an open vocabulary and predicting the +inter-object relationships from a grounded LLM with scene graph features and +queried object classes as context. Open3DSG is the first 3D point cloud method +to predict not only explicit open-vocabulary object classes, but also open-set +relationships that are not limited to a predefined label set, making it +possible to express rare as well as specific objects and relationships in the +predicted 3D scene graph. Our experiments show that Open3DSG is effective at +predicting arbitrary object classes as well as their complex inter-object +relationships describing spatial, supportive, semantic and comparative +relationships.",cs.CV,['cs.CV'] +SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation,Bin Xie · Jiale Cao · Jin Xie · Fahad Shahbaz Khan · Yanwei Pang,https://github.com/xb534/SED,https://arxiv.org/abs/2311.15537,,2311.15537.pdf,SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation,"Open-vocabulary semantic segmentation strives to distinguish pixels into +different semantic groups from an open set of categories. Most existing methods +explore utilizing pre-trained vision-language models, in which the key is to +adopt the image-level model for pixel-level segmentation task. In this paper, +we propose a simple encoder-decoder, named SED, for open-vocabulary semantic +segmentation, which comprises a hierarchical encoder-based cost map generation +and a gradual fusion decoder with category early rejection. The hierarchical +encoder-based cost map generation employs hierarchical backbone, instead of +plain transformer, to predict pixel-level image-text cost map. Compared to +plain transformer, hierarchical backbone better captures local spatial +information and has linear computational complexity with respect to input size. +Our gradual fusion decoder employs a top-down structure to combine cost map and +the feature maps of different backbone levels for segmentation. To accelerate +inference speed, we introduce a category early rejection scheme in the decoder +that rejects many no-existing categories at the early layer of decoder, +resulting in at most 4.7 times acceleration without accuracy degradation. +Experiments are performed on multiple open-vocabulary semantic segmentation +datasets, which demonstrates the efficacy of our SED method. When using +ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150 +categories at 82 millisecond ($ms$) per image on a single A6000. We will +release it at \url{https://github.com/xb534/SED.git}.",cs.CV,['cs.CV'] +"Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing",Boqiang Zhang · Hongtao Xie · Zuan Gao · Yuxin Wang, ,https://arxiv.org/abs/2405.12724,,2405.12724.pdf,RemoCap: Disentangled Representation Learning for Motion Capture,"Reconstructing 3D human bodies from realistic motion sequences remains a +challenge due to pervasive and complex occlusions. Current methods struggle to +capture the dynamics of occluded body parts, leading to model penetration and +distorted motion. RemoCap leverages Spatial Disentanglement (SD) and Motion +Disentanglement (MD) to overcome these limitations. SD addresses occlusion +interference between the target human body and surrounding objects. It achieves +this by disentangling target features along the dimension axis. By aligning +features based on their spatial positions in each dimension, SD isolates the +target object's response within a global window, enabling accurate capture +despite occlusions. The MD module employs a channel-wise temporal shuffling +strategy to simulate diverse scene dynamics. This process effectively +disentangles motion features, allowing RemoCap to reconstruct occluded parts +with greater fidelity. Furthermore, this paper introduces a sequence velocity +loss that promotes temporal coherence. This loss constrains inter-frame +velocity errors, ensuring the predicted motion exhibits realistic consistency. +Extensive comparisons with state-of-the-art (SOTA) methods on benchmark +datasets demonstrate RemoCap's superior performance in 3D human body +reconstruction. On the 3DPW dataset, RemoCap surpasses all competitors, +achieving the best results in MPVPE (81.9), MPJPE (72.7), and PA-MPJPE (44.1) +metrics. Codes are available at https://wanghongsheng01.github.io/RemoCap/.",cs.CV,['cs.CV'] +SpecNeRF: Gaussian Directional Encoding for Specular Reflections,Li Ma · Vasu Agrawal · Haithem Turki · Changil Kim · Chen Gao · Pedro V. Sander · Michael Zollhoefer · Christian Richardt, ,https://arxiv.org/abs/2312.13102,,2312.13102.pdf,SpecNeRF: Gaussian Directional Encoding for Specular Reflections,"Neural radiance fields have achieved remarkable performance in modeling the +appearance of 3D scenes. However, existing approaches still struggle with the +view-dependent appearance of glossy surfaces, especially under complex lighting +of indoor environments. Unlike existing methods, which typically assume distant +lighting like an environment map, we propose a learnable Gaussian directional +encoding to better model the view-dependent effects under near-field lighting +conditions. Importantly, our new directional encoding captures the +spatially-varying nature of near-field lighting and emulates the behavior of +prefiltered environment maps. As a result, it enables the efficient evaluation +of preconvolved specular color at any 3D location with varying roughness +coefficients. We further introduce a data-driven geometry prior that helps +alleviate the shape radiance ambiguity in reflection modeling. We show that our +Gaussian directional encoding and geometry prior significantly improve the +modeling of challenging specular reflections in neural radiance fields, which +helps decompose appearance into more physically meaningful components.",cs.CV,['cs.CV'] +Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation,Jiaming Liu · Ran Xu · Senqiao Yang · Renrui Zhang · Qizhe Zhang · Zehui Chen · Yandong Guo · Shanghang Zhang,https://sites.google.com/view/continual-mae/home,https://arxiv.org/abs/2312.12480,,2312.12480.pdf,Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation,"Continual Test-Time Adaptation (CTTA) is proposed to migrate a source +pre-trained model to continually changing target distributions, addressing +real-world dynamism. Existing CTTA methods mainly rely on entropy minimization +or teacher-student pseudo-labeling schemes for knowledge extraction in +unlabeled target domains. However, dynamic data distributions cause +miscalibrated predictions and noisy pseudo-labels in existing self-supervised +learning methods, hindering the effective mitigation of error accumulation and +catastrophic forgetting problems during the continual adaptation process. To +tackle these issues, we propose a continual self-supervised method, Adaptive +Distribution Masked Autoencoders (ADMA), which enhances the extraction of +target domain knowledge while mitigating the accumulation of distribution +shifts. Specifically, we propose a Distribution-aware Masking (DaM) mechanism +to adaptively sample masked positions, followed by establishing consistency +constraints between the masked target samples and the original target samples. +Additionally, for masked tokens, we utilize an efficient decoder to reconstruct +a hand-crafted feature descriptor (e.g., Histograms of Oriented Gradients), +leveraging its invariant properties to boost task-relevant representations. +Through conducting extensive experiments on four widely recognized benchmarks, +our proposed method attains state-of-the-art performance in both classification +and segmentation CTTA tasks. Our project page: +https://sites.google.com/view/continual-mae/home.",cs.CV,['cs.CV'] +3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow,Felix Taubner · Prashant Raina · Mathieu Tuli · Eu Wern Teh · Chul Lee · Jinmiao Huang,https://felixtaubner.github.io/flowface,https://arxiv.org/abs/2404.09819,,2404.09819.pdf,3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow,"When working with 3D facial data, improving fidelity and avoiding the uncanny +valley effect is critically dependent on accurate 3D facial performance +capture. Because such methods are expensive and due to the widespread +availability of 2D videos, recent methods have focused on how to perform +monocular 3D face tracking. However, these methods often fall short in +capturing precise facial movements due to limitations in their network +architecture, training, and evaluation processes. Addressing these challenges, +we propose a novel face tracker, FlowFace, that introduces an innovative 2D +alignment network for dense per-vertex alignment. Unlike prior work, FlowFace +is trained on high-quality 3D scan annotations rather than weak supervision or +synthetic data. Our 3D model fitting module jointly fits a 3D face model from +one or many observations, integrating existing neutral shape priors for +enhanced identity and expression disentanglement and per-vertex deformations +for detailed facial feature reconstruction. Additionally, we propose a novel +metric and benchmark for assessing tracking accuracy. Our method exhibits +superior performance on both custom and publicly available benchmarks. We +further validate the effectiveness of our tracker by generating high-quality 3D +data from 2D videos, which leads to performance gains on downstream tasks.",cs.CV,['cs.CV'] +Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption,Nobuhiko Wakai · Satoshi Sato · Yasunori Ishii · Takayoshi Yamashita, ,,https://paperswithcode.com/search?q=author:Yasunori+Ishii,,,,,nan +Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation,Hongwei Yan · Liyuan Wang · Kaisheng Ma · Yi Zhong, ,https://arxiv.org/abs/2404.00417,,2404.00417.pdf,Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation,"To accommodate real-world dynamics, artificial intelligence systems need to +cope with sequentially arriving content in an online manner. Beyond regular +Continual Learning (CL) attempting to address catastrophic forgetting with +offline training of each task, Online Continual Learning (OCL) is a more +challenging yet realistic setting that performs CL in a one-pass data stream. +Current OCL methods primarily rely on memory replay of old training samples. +However, a notable gap from CL to OCL stems from the additional +overfitting-underfitting dilemma associated with the use of rehearsal buffers: +the inadequate learning of new training samples (underfitting) and the repeated +learning of a few old training samples (overfitting). To this end, we introduce +a novel approach, Multi-level Online Sequential Experts (MOSE), which +cultivates the model as stacked sub-experts, integrating multi-level +supervision and reverse self-distillation. Supervision signals across multiple +stages facilitate appropriate convergence of the new task while gathering +various strengths from experts by knowledge distillation mitigates the +performance decline of old tasks. MOSE demonstrates remarkable efficacy in +learning new samples and preserving past knowledge through multi-level experts, +thereby significantly advancing OCL performance over state-of-the-art baselines +(e.g., up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet).",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains,George Eskandar, ,https://arxiv.org/abs/2402.17562v1,,2402.17562v1.pdf,An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains,"3D Object Detectors (3D-OD) are crucial for understanding the environment in +many robotic tasks, especially autonomous driving. Including 3D information via +Lidar sensors improves accuracy greatly. However, such detectors perform poorly +on domains they were not trained on, i.e. different locations, sensors, +weather, etc., limiting their reliability in safety-critical applications. +There exist methods to adapt 3D-ODs to these domains; however, these methods +treat 3D-ODs as a black box, neglecting underlying architectural decisions and +source-domain training strategies. Instead, we dive deep into the details of +3D-ODs, focusing our efforts on fundamental factors that influence robustness +prior to domain adaptation. + We systematically investigate four design choices (and the interplay between +them) often overlooked in 3D-OD robustness and domain adaptation: architecture, +voxel encoding, data augmentations, and anchor strategies. We assess their +impact on the robustness of nine state-of-the-art 3D-ODs across six benchmarks +encompassing three types of domain gaps - sensor type, weather, and location. + Our main findings are: (1) transformer backbones with local point features +are more robust than 3D CNNs, (2) test-time anchor size adjustment is crucial +for adaptation across geographical locations, significantly boosting scores +without retraining, (3) source-domain augmentations allow the model to +generalize to low-resolution sensors, and (4) surprisingly, robustness to bad +weather is improved when training directly on more clean weather data than on +training with bad weather data. We outline our main conclusions and findings to +provide practical guidance on developing more robust 3D-ODs.",cs.CV,['cs.CV'] +FreeKD: Knowledge Distillation via Semantic Frequency Prompt,Yuan Zhang · Tao Huang · Jiaming Liu · Tao Jiang · Kuan Cheng · Shanghang Zhang, ,https://arxiv.org/abs/2311.12079,,2311.12079.pdf,FreeKD: Knowledge Distillation via Semantic Frequency Prompt,"Knowledge distillation (KD) has been applied to various tasks successfully, +and mainstream methods typically boost the student model via spatial imitation +losses. However, the consecutive downsamplings induced in the spatial domain of +teacher model is a type of corruption, hindering the student from analyzing +what specific information needs to be imitated, which results in accuracy +degradation. To better understand the underlying pattern of corrupted feature +maps, we shift our attention to the frequency domain. During frequency +distillation, we encounter a new challenge: the low-frequency bands convey +general but minimal context, while the high are more informative but also +introduce noise. Not each pixel within the frequency bands contributes equally +to the performance. To address the above problem: (1) We propose the Frequency +Prompt plugged into the teacher model, absorbing the semantic frequency context +during finetuning. (2) During the distillation period, a pixel-wise frequency +mask is generated via Frequency Prompt, to localize those pixel of interests +(PoIs) in various frequency bands. Additionally, we employ a position-aware +relational frequency loss for dense prediction tasks, delivering a high-order +spatial enhancement to the student model. We dub our Frequency Knowledge +Distillation method as FreeKD, which determines the optimal localization and +extent for the frequency distillation. Extensive experiments demonstrate that +FreeKD not only outperforms spatial-based distillation methods consistently on +dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on +COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys +more robustness to the student. Notably, we also validate the generalization of +our approach on large-scale vision models (e.g., DINO and SAM).",cs.CV,['cs.CV'] +Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation,Jihyun Kim · Changjae Oh · Hoseok Do · Soohyun Kim · Kwanghoon Sohn, ,https://arxiv.org/abs/2405.04356,,2405.04356.pdf,Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation,"We present a new multi-modal face image generation method that converts a +text prompt and a visual input, such as a semantic mask or scribble map, into a +photo-realistic face image. To do this, we combine the strengths of Generative +Adversarial networks (GANs) and diffusion models (DMs) by employing the +multi-modal features in the DM into the latent space of the pre-trained GANs. +We present a simple mapping and a style modulation network to link two models +and convert meaningful representations in feature maps and attention maps into +latent codes. With GAN inversion, the estimated latent codes can be used to +generate 2D or 3D-aware facial images. We further present a multi-step training +strategy that reflects textual and structural representations into the +generated image. Our proposed network produces realistic 2D, multi-view, and +stylized face images, which align well with inputs. We validate our method by +using pre-trained 2D and 3D GANs, and our results outperform existing methods. +Our project page is available at +https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.",cs.CV,['cs.CV'] +Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion,Naishan Zheng · Man Zhou · Jie Huang · Junming Hou · Haoying Li · Yuan Xu · Feng Zhao, ,,https://ieeexplore.ieee.org/document/10539339,,,,,nan +Bridging the Gap Between End-to-End and Two-Step Text Spotting,Mingxin Huang · Hongliang Li · Yuliang Liu · Xiang Bai · Lianwen Jin, ,https://arxiv.org/abs/2404.04624,,2404.04624.pdf,Bridging the Gap Between End-to-End and Two-Step Text Spotting,"Modularity plays a crucial role in the development and maintenance of complex +systems. While end-to-end text spotting efficiently mitigates the issues of +error accumulation and sub-optimal performance seen in traditional two-step +methodologies, the two-step methods continue to be favored in many competitions +and practical settings due to their superior modularity. In this paper, we +introduce Bridging Text Spotting, a novel approach that resolves the error +accumulation and suboptimal performance issues in two-step methods while +retaining modularity. To achieve this, we adopt a well-trained detector and +recognizer that are developed and trained independently and then lock their +parameters to preserve their already acquired capabilities. Subsequently, we +introduce a Bridge that connects the locked detector and recognizer through a +zero-initialized neural network. This zero-initialized neural network, +initialized with weights set to zeros, ensures seamless integration of the +large receptive field features in detection into the locked recognizer. +Furthermore, since the fixed detector and recognizer cannot naturally acquire +end-to-end optimization features, we adopt the Adapter to facilitate their +efficient learning of these features. We demonstrate the effectiveness of the +proposed method through extensive experiments: Connecting the latest detector +and recognizer through Bridging Text Spotting, we achieved an accuracy of 83.3% +on Total-Text, 69.8% on CTW1500, and 89.5% on ICDAR 2015. The code is available +at https://github.com/mxin262/Bridging-Text-Spotting.",cs.CV,['cs.CV'] +OED: Towards One-stage End-to-End Dynamic Scene Graph Generation,Guan Wang · Zhimin Li · Qingchao Chen · Yang Liu, ,https://arxiv.org/abs/2405.16925,,2405.16925.pdf,OED: Towards One-stage End-to-End Dynamic Scene Graph Generation,"Dynamic Scene Graph Generation (DSGG) focuses on identifying visual +relationships within the spatial-temporal domain of videos. Conventional +approaches often employ multi-stage pipelines, which typically consist of +object detection, temporal association, and multi-relation classification. +However, these methods exhibit inherent limitations due to the separation of +multiple stages, and independent optimization of these sub-problems may yield +sub-optimal solutions. To remedy these limitations, we propose a one-stage +end-to-end framework, termed OED, which streamlines the DSGG pipeline. This +framework reformulates the task as a set prediction problem and leverages +pair-wise features to represent each subject-object pair within the scene +graph. Moreover, another challenge of DSGG is capturing temporal dependencies, +we introduce a Progressively Refined Module (PRM) for aggregating temporal +context without the constraints of additional trackers or handcrafted +trajectories, enabling end-to-end optimization of the network. Extensive +experiments conducted on the Action Genome benchmark demonstrate the +effectiveness of our design. The code and models are available at +\url{https://github.com/guanw-pku/OED}.",cs.CV,['cs.CV'] +Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding,Zhiheng Cheng · Qingyue Wei · Hongru Zhu · Yan Wang · Liangqiong Qu · Wei Shao · Yuyin Zhou, ,https://arxiv.org/abs/2403.18271,,2403.18271.pdf,Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding,"The Segment Anything Model (SAM) has garnered significant attention for its +versatile segmentation abilities and intuitive prompt-based interface. However, +its application in medical imaging presents challenges, requiring either +substantial training costs and extensive medical datasets for full model +fine-tuning or high-quality prompts for optimal performance. This paper +introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient +fine-tuning of medical images via a two-stage hierarchical decoding procedure. +In the initial stage, H-SAM employs SAM's original decoder to generate a prior +probabilistic mask, guiding a more intricate decoding process in the second +stage. Specifically, we propose two key designs: 1) A class-balanced, +mask-guided self-attention mechanism addressing the unbalanced label +distribution, enhancing image embedding; 2) A learnable mask cross-attention +mechanism spatially modulating the interplay among different image regions +based on the prior mask. Moreover, the inclusion of a hierarchical pixel +decoder in H-SAM enhances its proficiency in capturing fine-grained and +localized details. This approach enables SAM to effectively integrate learned +medical priors, facilitating enhanced adaptation for medical image segmentation +with limited samples. Our H-SAM demonstrates a 4.78% improvement in average +Dice compared to existing prompt-free SAM variants for multi-organ segmentation +using only 10% of 2D slices. Notably, without using any unlabeled data, H-SAM +even outperforms state-of-the-art semi-supervised models relying on extensive +unlabeled training data across various medical datasets. Our code is available +at https://github.com/Cccccczh404/H-SAM.",cs.CV,['cs.CV'] +Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers,Jinxia Xie · Bineng Zhong · Zhiyi Mo · Shengping Zhang · Liangtao Shi · Shuxiang Song · Rongrong Ji, ,https://arxiv.org/abs/2403.10574,,2403.10574.pdf,Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers,"The rich spatio-temporal information is crucial to capture the complicated +target appearance variations in visual tracking. However, most top-performing +tracking algorithms rely on many hand-crafted components for spatio-temporal +information aggregation. Consequently, the spatio-temporal information is far +away from being fully explored. To alleviate this issue, we propose an adaptive +tracker with spatio-temporal transformers (named AQATrack), which adopts simple +autoregressive queries to effectively learn spatio-temporal information without +many hand-designed components. Firstly, we introduce a set of learnable and +autoregressive queries to capture the instantaneous target appearance changes +in a sliding window fashion. Then, we design a novel attention mechanism for +the interaction of existing queries to generate a new query in current frame. +Finally, based on the initial target template and learnt autoregressive +queries, a spatio-temporal information fusion module (STM) is designed for +spatiotemporal formation aggregation to locate a target object. Benefiting from +the STM, we can effectively combine the static appearance and instantaneous +changes to guide robust tracking. Extensive experiments show that our method +significantly improves the tracker's performance on six popular tracking +benchmarks: LaSOT, LaSOText, TrackingNet, GOT-10k, TNL2K, and UAV123.",cs.CV,['cs.CV'] +GRAM: Global Reasoning for Multi-Page VQA,Itshak Blau · Sharon Fogel · Roi Ronen · Alona Golts · Shahar Tsiper · Elad Ben Avraham · Aviad Aberdam · Roy Ganz · Ron Litman, ,https://arxiv.org/abs/2401.03411,,2401.03411.pdf,GRAM: Global Reasoning for Multi-Page VQA,"The increasing use of transformer-based large language models brings forward +the challenge of processing long sequences. In document visual question +answering (DocVQA), leading methods focus on the single-page setting, while +documents can span hundreds of pages. We present GRAM, a method that seamlessly +extends pre-trained single-page models to the multi-page setting, without +requiring computationally-heavy pretraining. To do so, we leverage a +single-page encoder for local page-level understanding, and enhance it with +document-level designated layers and learnable tokens, facilitating the flow of +information across pages for global reasoning. To enforce our model to utilize +the newly introduced document tokens, we propose a tailored bias adaptation +method. For additional computational savings during decoding, we introduce an +optional compression stage using our compression-transformer +(C-Former),reducing the encoded sequence length, thereby allowing a tradeoff +between quality and latency. Extensive experiments showcase GRAM's +state-of-the-art performance on the benchmarks for multi-page DocVQA, +demonstrating the effectiveness of our approach.",cs.CL,"['cs.CL', 'cs.CV']" +Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability,Yan Huang · Zhang Zhang · Qiang Wu · yi zhong · Liang Wang, ,,https://www.youtube.com/watch?v=5xrCUp_gdwg,,,,,nan +AnyScene: Customized Image Synthesis with Composited Foreground,Ruidong Chen · Lanjun Wang · Weizhi Nie · Yongdong Zhang · An-An Liu, ,https://ar5iv.labs.arxiv.org/html/2302.09778,,2302.09778.pdf,Composer: Creative and Controllable Image Synthesis with Composable Conditions,"Recent large-scale generative models learned on big data are capable of +synthesizing incredible images yet suffer from limited controllability. This +work offers a new generation paradigm that allows flexible control of the +output image, such as spatial layout and palette, while maintaining the +synthesis quality and model creativity. With compositionality as the core idea, +we first decompose an image into representative factors, and then train a +diffusion model with all these factors as the conditions to recompose the +input. At the inference stage, the rich intermediate representations work as +composable elements, leading to a huge design space (i.e., exponentially +proportional to the number of decomposed factors) for customizable content +creation. It is noteworthy that our approach, which we call Composer, supports +various levels of conditions, such as text description as the global +information, depth map and sketch as the local guidance, color histogram for +low-level details, etc. Besides improving controllability, we confirm that +Composer serves as a general framework and facilitates a wide range of +classical generative tasks without retraining. Code and models will be made +available.",cs.CV,"['cs.CV', 'cs.GR']" +Multiway Point Cloud Mosaicking with Diffusion and Global Optimization,Shengze Jin · Iro Armeni · Marc Pollefeys · Daniel Barath, ,https://arxiv.org/abs/2404.00429,,2404.00429.pdf,Multiway Point Cloud Mosaicking with Diffusion and Global Optimization,"We introduce a novel framework for multiway point cloud mosaicking (named +Wednesday), designed to co-align sets of partially overlapping point clouds -- +typically obtained from 3D scanners or moving RGB-D cameras -- into a unified +coordinate system. At the core of our approach is ODIN, a learned pairwise +registration algorithm that iteratively identifies overlaps and refines +attention scores, employing a diffusion-based process for denoising pairwise +correlation matrices to enhance matching accuracy. Further steps include +constructing a pose graph from all point clouds, performing rotation averaging, +a novel robust algorithm for re-estimating translations optimally in terms of +consensus maximization and translation optimization. Finally, the point cloud +rotations and positions are optimized jointly by a diffusion-based approach. +Tested on four diverse, large-scale datasets, our method achieves +state-of-the-art pairwise and multiway registration results by a large margin +on all benchmarks. Our code and models are available at +https://github.com/jinsz/Multiway-Point-Cloud-Mosaicking-with-Diffusion-and-Global-Optimization.",cs.CV,['cs.CV'] +Dexterous Grasp Transformer,Guo-Hao Xu · Yi-Lin Wei · Dian Zheng · Xiao-Ming Wu · Wei-Shi Zheng, ,https://arxiv.org/abs/2404.18135,,2404.18135.pdf,Dexterous Grasp Transformer,"In this work, we propose a novel discriminative framework for dexterous grasp +generation, named Dexterous Grasp TRansformer (DGTR), capable of predicting a +diverse set of feasible grasp poses by processing the object point cloud with +only one forward pass. We formulate dexterous grasp generation as a set +prediction task and design a transformer-based grasping model for it. However, +we identify that this set prediction paradigm encounters several optimization +challenges in the field of dexterous grasping and results in restricted +performance. To address these issues, we propose progressive strategies for +both the training and testing phases. First, the dynamic-static matching +training (DSMT) strategy is presented to enhance the optimization stability +during the training phase. Second, we introduce the adversarial-balanced +test-time adaptation (AB-TTA) with a pair of adversarial losses to improve +grasping quality during the testing phase. Experimental results on the +DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous +grasp poses with both high quality and diversity. Notably, while keeping high +quality, the diversity of grasp poses predicted by DGTR significantly +outperforms previous works in multiple metrics without any data pre-processing. +Codes are available at https://github.com/iSEE-Laboratory/DGTR .",cs.RO,['cs.RO'] +MoCha-Stereo: Motif Channel Attention Network for Stereo Matching,Ziyang Chen · Wei Long · He Yao · Yongjun Zhang · Bingshu Wang · Yongbin Qin · Jia Wu,https://github.com/ZYangChen/MoCha-Stereo,https://arxiv.org/abs/2404.06842,,2404.06842.pdf,MoCha-Stereo: Motif Channel Attention Network for Stereo Matching,"Learning-based stereo matching techniques have made significant progress. +However, existing methods inevitably lose geometrical structure information +during the feature channel generation process, resulting in edge detail +mismatches. In this paper, the Motif Cha}nnel Attention Stereo Matching Network +(MoCha-Stereo) is designed to address this problem. We provide the Motif +Channel Correlation Volume (MCCV) to determine more accurate edge matching +costs. MCCV is achieved by projecting motif channels, which capture common +geometric structures in feature channels, onto feature maps and cost volumes. +In addition, edge variations in %potential feature channels of the +reconstruction error map also affect details matching, we propose the +Reconstruction Error Motif Penalty (REMP) module to further refine the +full-resolution disparity estimation. REMP integrates the frequency information +of typical channel features from the reconstruction error. MoCha-Stereo ranks +1st on the KITTI-2015 and KITTI-2012 Reflective leaderboards. Our structure +also shows excellent performance in Multi-View Stereo. Code is avaliable at +https://github.com/ZYangChen/MoCha-Stereo.",cs.CV,['cs.CV'] +Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis,Zhan Li · Zhang Chen · Zhong Li · Yi Xu,https://oppo-us-research.github.io/SpacetimeGaussians-website/,https://arxiv.org/abs/2312.16812,,2312.16812.pdf,Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis,"Novel view synthesis of dynamic scenes has been an intriguing yet challenging +problem. Despite recent advancements, simultaneously achieving high-resolution +photorealistic results, real-time rendering, and compact storage remains a +formidable task. To address these challenges, we propose Spacetime Gaussian +Feature Splatting as a novel dynamic scene representation, composed of three +pivotal components. First, we formulate expressive Spacetime Gaussians by +enhancing 3D Gaussians with temporal opacity and parametric motion/rotation. +This enables Spacetime Gaussians to capture static, dynamic, as well as +transient content within a scene. Second, we introduce splatted feature +rendering, which replaces spherical harmonics with neural features. These +features facilitate the modeling of view- and time-dependent appearance while +maintaining small size. Third, we leverage the guidance of training error and +coarse depth to sample new Gaussians in areas that are challenging to converge +with existing pipelines. Experiments on several established real-world datasets +demonstrate that our method achieves state-of-the-art rendering quality and +speed, while retaining compact storage. At 8K resolution, our lite-version +model can render at 60 FPS on an Nvidia RTX 4090 GPU. Our code is available at +https://github.com/oppo-us-research/SpacetimeGaussians.",cs.CV,"['cs.CV', 'cs.GR']" +MoReVQA: Exploring Modular Reasoning Models for Video Question Answering,Juhong Min · Shyamal Buch · Arsha Nagrani · Minsu Cho · Cordelia Schmid, ,https://arxiv.org/abs/2404.06511,,2404.06511.pdf,MoReVQA: Exploring Modular Reasoning Models for Video Question Answering,"This paper addresses the task of video question answering (videoQA) via a +decomposed multi-stage, modular reasoning framework. Previous modular methods +have shown promise with a single planning stage ungrounded in visual content. +However, through a simple and effective baseline, we find that such systems can +lead to brittle behavior in practice for challenging videoQA settings. Thus, +unlike traditional single-stage planning methods, we propose a multi-stage +system consisting of an event parser, a grounding stage, and a final reasoning +stage in conjunction with an external memory. All stages are training-free, and +performed using few-shot prompting of large models, creating interpretable +intermediate outputs at each stage. By decomposing the underlying planning and +task complexity, our method, MoReVQA, improves over prior work on standard +videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with +state-of-the-art results, and extensions to related tasks (grounded videoQA, +paragraph captioning).",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation,Yifei Li · Hsiaoyu Chen · Egor Larionov · Nikolaos Sarafianos · Wojciech Matusik · Tuur Stuyck, ,https://arxiv.org/abs/2311.12194,,2311.12194.pdf,DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation,"The realism of digital avatars is crucial in enabling telepresence +applications with self-expression and customization. While physical simulations +can produce realistic motions for clothed humans, they require high-quality +garment assets with associated physical parameters for cloth simulations. +However, manually creating these assets and calibrating their parameters is +labor-intensive and requires specialized expertise. Current methods focus on +reconstructing geometry, but don't generate complete assets for physics-based +applications. To address this gap, we propose \papername,~a novel approach that +performs body and garment co-optimization using differentiable simulation. By +integrating physical simulation into the optimization loop and accounting for +the complex nonlinear behavior of cloth and its intricate interaction with the +body, our framework recovers body and garment geometry and extracts important +material parameters in a physically plausible way. Our experiments demonstrate +that our approach generates realistic clothing and body shape suitable for +downstream applications. We provide additional insights and results on our +webpage: https://people.csail.mit.edu/liyifei/publication/diffavatar/",cs.CV,['cs.CV'] +SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers,Ioannis Kakogeorgiou · Spyros Gidaris · Konstantinos Karantzalos · Nikos Komodakis, ,https://arxiv.org/abs/2312.00648,,2312.00648.pdf,SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers,"Unsupervised object-centric learning aims to decompose scenes into +interpretable object entities, termed slots. Slot-based auto-encoders stand out +as a prominent method for this task. Within them, crucial aspects include +guiding the encoder to generate object-specific slots and ensuring the decoder +utilizes them during reconstruction. This work introduces two novel techniques, +(i) an attention-based self-training approach, which distills superior +slot-based attention masks from the decoder to the encoder, enhancing object +segmentation, and (ii) an innovative patch-order permutation strategy for +autoregressive transformers that strengthens the role of slot vectors in +reconstruction. The effectiveness of these strategies is showcased +experimentally. The combined approach significantly surpasses prior slot-based +autoencoder methods in unsupervised object segmentation, especially with +complex real-world images. We provide the implementation code at +https://github.com/gkakogeorgiou/spot .",cs.CV,['cs.CV'] +Color Shift Estimation-and-Correction for Image Enhancement,Yiyu Li · Ke Xu · Gerhard Hancke · Rynson W.H. Lau, ,https://arxiv.org/abs/2405.17725,,2405.17725.pdf,Color Shift Estimation-and-Correction for Image Enhancement,"Images captured under sub-optimal illumination conditions may contain both +over- and under-exposures. Current approaches mainly focus on adjusting image +brightness, which may exacerbate the color tone distortion in under-exposed +areas and fail to restore accurate colors in over-exposed regions. We observe +that over- and under-exposed regions display opposite color tone distribution +shifts with respect to each other, which may not be easily normalized in joint +modeling as they usually do not have ``normal-exposed'' regions/pixels as +reference. In this paper, we propose a novel method to enhance images with both +over- and under-exposures by learning to estimate and correct such color +shifts. Specifically, we first derive the color feature maps of the brightened +and darkened versions of the input image via a UNet-based network, followed by +a pseudo-normal feature generator to produce pseudo-normal color feature maps. +We then propose a novel COlor Shift Estimation (COSE) module to estimate the +color shifts between the derived brightened (or darkened) color feature maps +and the pseudo-normal color feature maps. The COSE module corrects the +estimated color shifts of the over- and under-exposed regions separately. We +further propose a novel COlor MOdulation (COMO) module to modulate the +separately corrected colors in the over- and under-exposed regions to produce +the enhanced image. Comprehensive experiments show that our method outperforms +existing approaches. Project webpage: https://github.com/yiyulics/CSEC.",cs.CV,['cs.CV'] +Human Gaussian Splatting : Real-time Rendering of Animatable Avatars,Arthur Moreau · Jifei Song · Helisa Dhamo · Richard Shaw · Yiren Zhou · Eduardo Pérez-Pellitero,https://perezpellitero.github.io/projects/hugs/index.html,https://arxiv.org/abs/2311.17113,,2311.17113.pdf,Human Gaussian Splatting: Real-time Rendering of Animatable Avatars,"This work addresses the problem of real-time rendering of photorealistic +human body avatars learned from multi-view videos. While the classical +approaches to model and render virtual humans generally use a textured mesh, +recent research has developed neural body representations that achieve +impressive visual quality. However, these models are difficult to render in +real-time and their quality degrades when the character is animated with body +poses different than the training observations. We propose an animatable human +model based on 3D Gaussian Splatting, that has recently emerged as a very +efficient alternative to neural radiance fields. The body is represented by a +set of gaussian primitives in a canonical space which is deformed with a coarse +to fine approach that combines forward skinning and local non-rigid refinement. +We describe how to learn our Human Gaussian Splatting (HuGS) model in an +end-to-end fashion from multi-view observations, and evaluate it against the +state-of-the-art approaches for novel pose synthesis of clothed body. Our +method achieves 1.5 dB PSNR improvement over the state-of-the-art on THuman4 +dataset while being able to render in real-time (80 fps for 512x512 +resolution).",cs.CV,"['cs.CV', 'cs.GR']" +Boosting Spike Camera Image Reconstruction from a Perspective of Dealing with Spike Fluctuations,Rui Zhao · Ruiqin Xiong · Jing Zhao · Jian Zhang · Xiaopeng Fan · Zhaofei Yu · Tiejun Huang, ,https://ar5iv.labs.arxiv.org/html/2303.11684,,2303.11684.pdf,SpikeCV: Open a Continuous Computer Vision Era,"SpikeCV is a new open-source computer vision platform for the spike camera, +which is a neuromorphic visual sensor that has developed rapidly in recent +years. In the spike camera, each pixel position directly accumulates the light +intensity and asynchronously fires spikes. The output binary spikes can reach a +frequency of 40,000 Hz. As a new type of visual expression, spike sequence has +high spatiotemporal completeness and preserves the continuous visual +information of the external world. Taking advantage of the low latency and high +dynamic range of the spike camera, many spike-based algorithms have made +significant progress, such as high-quality imaging and ultra-high-speed target +detection. + To build up a community ecology for the spike vision to facilitate more users +to take advantage of the spike camera, SpikeCV provides a variety of +ultra-high-speed scene datasets, hardware interfaces, and an easy-to-use +modules library. SpikeCV focuses on encapsulation for spike data, +standardization for dataset interfaces, modularization for vision tasks, and +real-time applications for challenging scenes. With the advent of the +open-source Python ecosystem, modules of SpikeCV can be used as a Python +library to fulfilled most of the numerical analysis needs of researchers. We +demonstrate the efficiency of the SpikeCV on offline inference and real-time +applications. The project repository address are +\url{https://openi.pcl.ac.cn/Cordium/SpikeCV} and +\url{https://github.com/Zyj061/SpikeCV",cs.CV,['cs.CV'] +DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans,Akash Sengupta · Thiemo Alldieck · NIKOS KOLOTOUROS · Enric Corona · Andrei Zanfir · Cristian Sminchisescu,https://akashsengupta1997.github.io/diffhuman/,https://arxiv.org/abs/2404.00485,,2404.00485.pdf,DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans,"We present DiffHuman, a probabilistic method for photorealistic 3D human +reconstruction from a single RGB image. Despite the ill-posed nature of this +problem, most methods are deterministic and output a single solution, often +resulting in a lack of geometric detail and blurriness in unseen or uncertain +regions. In contrast, DiffHuman predicts a probability distribution over 3D +reconstructions conditioned on an input 2D image, which allows us to sample +multiple detailed 3D avatars that are consistent with the image. DiffHuman is +implemented as a conditional diffusion model that denoises pixel-aligned 2D +observations of an underlying 3D shape representation. During inference, we may +sample 3D avatars by iteratively denoising 2D renders of the predicted 3D +representation. Furthermore, we introduce a generator neural network that +approximates rendering with considerably reduced runtime (55x speed up), +resulting in a novel dual-branch diffusion framework. Our experiments show that +DiffHuman can produce diverse and detailed reconstructions for the parts of the +person that are unseen or uncertain in the input image, while remaining +competitive with the state-of-the-art when reconstructing visible surfaces.",cs.CV,['cs.CV'] +LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation,Ke Guo · Zhenwei Miao · Wei Jing · Weiwei Liu · Weizi Li · Dayang Hao · Jia Pan,https://sites.google.com/view/lasil,https://arxiv.org/abs/2403.17601,,2403.17601.pdf,LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation,"Microscopic traffic simulation plays a crucial role in transportation +engineering by providing insights into individual vehicle behavior and overall +traffic flow. However, creating a realistic simulator that accurately +replicates human driving behaviors in various traffic conditions presents +significant challenges. Traditional simulators relying on heuristic models +often fail to deliver accurate simulations due to the complexity of real-world +traffic environments. Due to the covariate shift issue, existing imitation +learning-based simulators often fail to generate stable long-term simulations. +In this paper, we propose a novel approach called learner-aware supervised +imitation learning to address the covariate shift problem in multi-agent +imitation learning. By leveraging a variational autoencoder simultaneously +modeling the expert and learner state distribution, our approach augments +expert states such that the augmented state is aware of learner state +distribution. Our method, applied to urban traffic simulation, demonstrates +significant improvements over existing state-of-the-art baselines in both +short-term microscopic and long-term macroscopic realism when evaluated on the +real-world dataset pNEUMA.",cs.AI,"['cs.AI', 'cs.LG']" +Sparse Semi-Detr: Sparse Learnable Queries for Semi-Supervised Object Detection,Tahira Shehzadi · Khurram Azeem Hashmi · Didier Stricker · Muhammad Zeshan Afzal, ,https://arxiv.org/abs/2404.01819,,2404.01819.pdf,Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection,"In this paper, we address the limitations of the DETR-based semi-supervised +object detection (SSOD) framework, particularly focusing on the challenges +posed by the quality of object queries. In DETR-based SSOD, the one-to-one +assignment strategy provides inaccurate pseudo-labels, while the one-to-many +assignments strategy leads to overlapping predictions. These issues compromise +training efficiency and degrade model performance, especially in detecting +small or occluded objects. We introduce Sparse Semi-DETR, a novel +transformer-based, end-to-end semi-supervised object detection solution to +overcome these challenges. Sparse Semi-DETR incorporates a Query Refinement +Module to enhance the quality of object queries, significantly improving +detection capabilities for small and partially obscured objects. Additionally, +we integrate a Reliable Pseudo-Label Filtering Module that selectively filters +high-quality pseudo-labels, thereby enhancing detection accuracy and +consistency. On the MS-COCO and Pascal VOC object detection benchmarks, Sparse +Semi-DETR achieves a significant improvement over current state-of-the-art +methods that highlight Sparse Semi-DETR's effectiveness in semi-supervised +object detection, particularly in challenging scenarios involving small or +partially obscured objects.",cs.CV,['cs.CV'] +CityDreamer: Compositional Generative Model of Unbounded 3D Cities,Haozhe Xie · Zhaoxi Chen · Fangzhou Hong · Ziwei Liu,https://www.infinitescript.com/project/city-dreamer,https://arxiv.org/abs/2309.00610,,2309.00610.pdf,CityDreamer: Compositional Generative Model of Unbounded 3D Cities,"3D city generation is a desirable yet challenging task, since humans are more +sensitive to structural distortions in urban environments. Additionally, +generating 3D cities is more complex than 3D natural scenes since buildings, as +objects of the same class, exhibit a wider range of appearances compared to the +relatively consistent appearance of objects like trees in natural scenes. To +address these challenges, we propose \textbf{CityDreamer}, a compositional +generative model designed specifically for unbounded 3D cities. Our key insight +is that 3D city generation should be a composition of different types of neural +fields: 1) various building instances, and 2) background stuff, such as roads +and green lands. Specifically, we adopt the bird's eye view scene +representation and employ a volumetric render for both instance-oriented and +stuff-oriented neural fields. The generative hash grid and periodic positional +embedding are tailored as scene parameterization to suit the distinct +characteristics of building instances and background stuff. Furthermore, we +contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which +comprises a vast amount of real-world city imagery to enhance the realism of +the generated 3D cities both in their layouts and appearances. CityDreamer +achieves state-of-the-art performance not only in generating realistic 3D +cities but also in localized editing within the generated cities.",cs.CV,['cs.CV'] +"One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications",Mengyao Lyu · Yuhong Yang · Haiwen Hong · Hui Chen · Xuan Jin · Yuan He · Hui Xue · Jungong Han · Guiguang Ding,https://lyumengyao.github.io/projects/spm,https://arxiv.org/abs/2312.16145,,2312.16145.pdf,"One-Dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications","The prevalent use of commercial and open-source diffusion models (DMs) for +text-to-image generation prompts risk mitigation to prevent undesired +behaviors. Existing concept erasing methods in academia are all based on full +parameter or specification-based fine-tuning, from which we observe the +following issues: 1) Generation alternation towards erosion: Parameter drift +during target elimination causes alternations and potential deformations across +all generations, even eroding other concepts at varying degrees, which is more +evident with multi-concept erased; 2) Transfer inability & deployment +inefficiency: Previous model-specific erasure impedes the flexible combination +of concepts and the training-free transfer towards other models, resulting in +linear cost growth as the deployment scenarios increase. To achieve +non-invasive, precise, customizable, and transferable elimination, we ground +our erasing framework on one-dimensional adapters to erase multiple concepts +from most DMs at once across versatile erasing applications. The +concept-SemiPermeable structure is injected as a Membrane (SPM) into any DM to +learn targeted erasing, and meantime the alteration and erosion phenomenon is +effectively mitigated via a novel Latent Anchoring fine-tuning strategy. Once +obtained, SPMs can be flexibly combined and plug-and-play for other DMs without +specific re-tuning, enabling timely and efficient adaptation to diverse +scenarios. During generation, our Facilitated Transport mechanism dynamically +regulates the permeability of each SPM to respond to different input prompts, +further minimizing the impact on other concepts. Quantitative and qualitative +results across ~40 concepts, 7 DMs and 4 erasing applications have demonstrated +the superior erasing of SPM. Our code and pre-tuned SPMs are available on the +project page https://lyumengyao.github.io/projects/spm.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention,Ju-Hyeon Nam · Nur Suriza Syazwany · Su Jung Kim · Sang-Chul Lee,https://skawngus1111.github.io/MADGNet_project/,https://arxiv.org/abs/2405.06284,,2405.06284.pdf,Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention,"Generalizability in deep neural networks plays a pivotal role in medical +image segmentation. However, deep learning-based medical image analyses tend to +overlook the importance of frequency variance, which is critical element for +achieving a model that is both modality-agnostic and domain-generalizable. +Additionally, various models fail to account for the potential information loss +that can arise from multi-task learning under deep supervision, a factor that +can impair the model representation ability. To address these challenges, we +propose a Modality-agnostic Domain Generalizable Network (MADGNet) for medical +image segmentation, which comprises two key components: a Multi-Frequency in +Multi-Scale Attention (MFMSA) block and Ensemble Sub-Decoding Module (E-SDM). +The MFMSA block refines the process of spatial feature extraction, particularly +in capturing boundary features, by incorporating multi-frequency and +multi-scale features, thereby offering informative cues for tissue outline and +anatomical structures. Moreover, we propose E-SDM to mitigate information loss +in multi-task learning with deep supervision, especially during substantial +upsampling from low resolution. We evaluate the segmentation performance of +MADGNet across six modalities and fifteen datasets. Through extensive +experiments, we demonstrate that MADGNet consistently outperforms +state-of-the-art models across various modalities, showcasing superior +segmentation performance. This affirms MADGNet as a robust solution for medical +image segmentation that excels in diverse imaging scenarios. Our MADGNet code +is available in GitHub Link.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']" +CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering,Shaowei Wang · Lingling Zhang · Longji Zhu · Tao Qin · Kim-Hui Yap · Xinyu Zhang · Jun Liu, ,https://arxiv.org/abs/2312.17269,,2312.17269.pdf,Conversational Question Answering with Reformulations over Knowledge Graph,"Conversational question answering (convQA) over knowledge graphs (KGs) +involves answering multi-turn natural language questions about information +contained in a KG. State-of-the-art methods of ConvQA often struggle with +inexplicit question-answer pairs. These inputs are easy for human beings to +understand given a conversation history, but hard for a machine to interpret, +which can degrade ConvQA performance. To address this problem, we propose a +reinforcement learning (RL) based model, CornNet, which utilizes question +reformulations generated by large language models (LLMs) to improve ConvQA +performance. CornNet adopts a teacher-student architecture where a teacher +model learns question representations using human writing reformulations, and a +student model to mimic the teacher model's output via reformulations generated +by LLMs. The learned question representation is then used by an RL model to +locate the correct answer in a KG. Extensive experimental results show that +CornNet outperforms state-of-the-art convQA models.",cs.CL,"['cs.CL', 'cs.AI']" +Lane2Seq: Towards Unified Lane Detection via Sequence Generation,Kunyang Zhou,https://zkyseu.github.io/lane2seq.github.io/,https://arxiv.org/abs/2402.17172,,2402.17172.pdf,Lane2Seq: Towards Unified Lane Detection via Sequence Generation,"In this paper, we present a novel sequence generation-based framework for +lane detection, called Lane2Seq. It unifies various lane detection formats by +casting lane detection as a sequence generation task. This is different from +previous lane detection methods, which depend on well-designed task-specific +head networks and corresponding loss functions. Lane2Seq only adopts a plain +transformer-based encoder-decoder architecture with a simple cross-entropy +loss. Additionally, we propose a new multi-format model tuning based on +reinforcement learning to incorporate the task-specific knowledge into +Lane2Seq. Experimental results demonstrate that such a simple sequence +generation paradigm not only unifies lane detection but also achieves +competitive performance on benchmarks. For example, Lane2Seq gets 97.95\% and +97.42\% F1 score on Tusimple and LLAMAS datasets, establishing a new +state-of-the-art result for two benchmarks.",cs.CV,['cs.CV'] +Scaling Up Dynamic 3D Human-Scene Interaction Modelling,Nan Jiang · Zhiyuan Zhang · Hongjie Li · Xiaoxuan Ma · Zan Wang · Yixin Chen · Tengyu Liu · Yixin Zhu · Siyuan Huang, ,https://arxiv.org/abs/2403.08629,,2403.08629.pdf,Scaling Up Dynamic Human-Scene Interaction Modeling,"Confronting the challenges of data scarcity and advanced motion synthesis in +human-scene interaction modeling, we introduce the TRUMANS dataset alongside a +novel HSI motion synthesis method. TRUMANS stands as the most comprehensive +motion-captured HSI dataset currently available, encompassing over 15 hours of +human interactions across 100 indoor scenes. It intricately captures whole-body +human motions and part-level object dynamics, focusing on the realism of +contact. This dataset is further scaled up by transforming physical +environments into exact virtual models and applying extensive augmentations to +appearance and motion for both humans and objects while maintaining interaction +fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model +that efficiently generates HSI sequences of any length, taking into account +both scene context and intended actions. In experiments, our approach shows +remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., +PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic +original motion-captured sequences, as confirmed by quantitative experiments +and human studies.",cs.CV,['cs.CV'] +QUADify: Extracting Meshes with Pixel-level Details and Materials from Images,Maximilian Frühauf · Hayko Riemenschneider · Markus Gross · Christopher Schroers,https://maxfruehauf.com/publications/fruehauf2024quadify/drs_project_page/,,https://www.youtube.com/watch?v=n8M9c9yKGMk,,,,,nan +BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning,Ruyang Liu · Chen Li · Yixiao Ge · Thomas H. Li · Ying Shan · Ge Li, ,http://export.arxiv.org/abs/2309.15785,,2309.15785.pdf,One For All: Video Conversation is Feasible Without Video Instruction Tuning,"The recent progress in Large Language Models (LLM) has spurred various +advancements in image-language conversation agents, while how to build a +proficient video-based dialogue system is still under exploration. Considering +the extensive scale of LLM and visual backbone, minimal GPU memory is left for +facilitating effective temporal modeling, which is crucial for comprehending +and providing feedback on videos. To this end, we propose Branching Temporal +Adapter (BT-Adapter), a novel method for extending image-language pretrained +models into the video domain. Specifically, BT-Adapter serves as a plug-and-use +temporal modeling branch alongside the pretrained visual encoder, which is +tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can +be seamlessly integrated into all image conversation models using this version +of CLIP, enabling video conversations without the need for video instructions. +Besides, we develop a unique asymmetric token masking strategy inside the +branch with tailor-made training tasks for BT-Adapter, facilitating faster +convergence and better results. Thanks to BT-Adapter, we are able to empower +existing multimodal dialogue models with strong video understanding +capabilities without incurring excessive GPU costs. Without bells and whistles, +BT-Adapter achieves (1) state-of-the-art zero-shot results on various video +tasks using thousands of fewer GPU hours. (2) better performance than current +video chatbots without any video instruction tuning. (3) state-of-the-art +results of video chatting using video instruction tuning, outperforming +previous SOTAs by a large margin.",cs.CV,['cs.CV'] +Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection,Ting Lei · Shaofeng Yin · Yang Liu, ,https://arxiv.org/abs/2404.06194,,2404.06194.pdf,Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection,"Open-vocabulary human-object interaction (HOI) detection, which is concerned +with the problem of detecting novel HOIs guided by natural language, is crucial +for understanding human-centric scenes. However, prior zero-shot HOI detectors +often employ the same levels of feature maps to model HOIs with varying +distances, leading to suboptimal performance in scenes containing human-object +pairs with a wide range of distances. In addition, these detectors primarily +rely on category names and overlook the rich contextual information that +language can provide, which is essential for capturing open vocabulary concepts +that are typically rare and not well-represented by category names alone. In +this paper, we introduce a novel end-to-end open vocabulary HOI detection +framework with conditional multi-level decoding and fine-grained semantic +enhancement (CMD-SE), harnessing the potential of Visual-Language Models +(VLMs). Specifically, we propose to model human-object pairs with different +distances with different levels of feature maps by incorporating a soft +constraint during the bipartite matching process. Furthermore, by leveraging +large language models (LLMs) such as GPT models, we exploit their extensive +world knowledge to generate descriptions of human body part states for various +interactions. Then we integrate the generalizable and fine-grained semantics of +human body parts to improve interaction recognition. Experimental results on +two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method +achieves state-of-the-art results in open vocabulary HOI detection. The code +and models are available at https://github.com/ltttpku/CMD-SE-release.",cs.CV,['cs.CV'] +SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction,Zechuan Zhang · Zongxin Yang · Yi Yang,https://river-zhang.github.io/SIFU-projectpage/,https://arxiv.org/abs/2312.06704,,2312.06704.pdf,SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction,"Creating high-quality 3D models of clothed humans from single images for +real-world applications is crucial. Despite recent advancements, accurately +reconstructing humans in complex poses or with loose clothing from in-the-wild +images, along with predicting textures for unseen areas, remains a significant +challenge. A key limitation of previous methods is their insufficient prior +guidance in transitioning from 2D to 3D and in texture prediction. In response, +we introduce SIFU (Side-view Conditioned Implicit Function for Real-world +Usable Clothed Human Reconstruction), a novel approach combining a Side-view +Decoupling Transformer with a 3D Consistent Texture Refinement pipeline.SIFU +employs a cross-attention mechanism within the transformer, using SMPL-X +normals as queries to effectively decouple side-view features in the process of +mapping 2D features to 3D. This method not only improves the precision of the +3D models but also their robustness, especially when SMPL-X estimates are not +perfect. Our texture refinement process leverages text-to-image diffusion-based +prior to generate realistic and consistent textures for invisible views. +Through extensive experiments, SIFU surpasses SOTA methods in both geometry and +texture reconstruction, showcasing enhanced robustness in complex scenarios and +achieving an unprecedented Chamfer and P2S measurement. Our approach extends to +practical applications such as 3D printing and scene building, demonstrating +its broad utility in real-world scenarios. Project page +https://river-zhang.github.io/SIFU-projectpage/ .",cs.CV,['cs.CV'] +Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images,Chaoqin Huang · Aofan Jiang · Jinghao Feng · Ya Zhang · Xinchao Wang · Yanfeng Wang, ,https://arxiv.org/abs/2403.12570,,2403.12570.pdf,Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images,"Recent advancements in large-scale visual-language pre-trained models have +led to significant progress in zero-/few-shot anomaly detection within natural +image domains. However, the substantial domain divergence between natural and +medical images limits the effectiveness of these methodologies in medical +anomaly detection. This paper introduces a novel lightweight multi-level +adaptation and comparison framework to repurpose the CLIP model for medical +anomaly detection. Our approach integrates multiple residual adapters into the +pre-trained visual encoder, enabling a stepwise enhancement of visual features +across different levels. This multi-level adaptation is guided by multi-level, +pixel-wise visual-language feature alignment loss functions, which recalibrate +the model's focus from object semantics in natural imagery to anomaly +identification in medical images. The adapted features exhibit improved +generalization across various medical data types, even in zero-shot scenarios +where the model encounters unseen medical modalities and anatomical regions +during training. Our experiments on medical anomaly detection benchmarks +demonstrate that our method significantly surpasses current state-of-the-art +models, with an average AUC improvement of 6.24% and 7.33% for anomaly +classification, 2.03% and 2.37% for anomaly segmentation, under the zero-shot +and few-shot settings, respectively. Source code is available at: +https://github.com/MediaBrain-SJTU/MVFA-AD",cs.CV,['cs.CV'] +Panacea: Panoramic and Controllable Video Generation for Autonomous Driving,Yuqing Wen · Yucheng Zhao · Yingfei Liu · Fan Jia · Yanhui Wang · Chong Luo · Chi Zhang · Tiancai Wang · Xiaoyan Sun · Xiangyu Zhang, ,https://arxiv.org/abs/2311.16813,,2311.16813.pdf,Panacea: Panoramic and Controllable Video Generation for Autonomous Driving,"The field of autonomous driving increasingly demands high-quality annotated +training data. In this paper, we propose Panacea, an innovative approach to +generate panoramic and controllable videos in driving scenarios, capable of +yielding an unlimited numbers of diverse, annotated samples pivotal for +autonomous driving advancements. Panacea addresses two critical challenges: +'Consistency' and 'Controllability.' Consistency ensures temporal and +cross-view coherence, while Controllability ensures the alignment of generated +content with corresponding annotations. Our approach integrates a novel 4D +attention and a two-stage generation pipeline to maintain coherence, +supplemented by the ControlNet framework for meticulous control by the +Bird's-Eye-View (BEV) layouts. Extensive qualitative and quantitative +evaluations of Panacea on the nuScenes dataset prove its effectiveness in +generating high-quality multi-view driving-scene videos. This work notably +propels the field of autonomous driving by effectively augmenting the training +dataset used for advanced BEV perception techniques.",cs.CV,['cs.CV'] +No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation,Xiangyang Zhu · Renrui Zhang · Bowei He · Ziyu Guo · Jiaming Liu · Han Xiao · Chaoyou Fu · Hao Dong · Peng Gao, ,https://arxiv.org/abs/2404.04050,,2404.04050.pdf,No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation,"To reduce the reliance on large-scale datasets, recent works in 3D +segmentation resort to few-shot learning. Current 3D few-shot segmentation +methods first pre-train models on 'seen' classes, and then evaluate their +generalization performance on 'unseen' classes. However, the prior pre-training +stage not only introduces excessive time overhead but also incurs a significant +domain gap on 'unseen' classes. To tackle these issues, we propose a +Non-parametric Network for few-shot 3D Segmentation, Seg-NN, and its Parametric +variant, Seg-PN. Without training, Seg-NN extracts dense representations by +hand-crafted filters and achieves comparable performance to existing parametric +models. Due to the elimination of pre-training, Seg-NN can alleviate the domain +gap issue and save a substantial amount of time. Based on Seg-NN, Seg-PN only +requires training a lightweight QUEry-Support Transferring (QUEST) module, +which enhances the interaction between the support set and query set. +Experiments suggest that Seg-PN outperforms previous state-of-the-art method by ++4.19% and +7.71% mIoU on S3DIS and ScanNet datasets respectively, while +reducing training time by -90%, indicating its effectiveness and efficiency.",cs.CV,['cs.CV'] +MoST: Motion Style Transformer between Diverse Action Contents,Boeun Kim · Jungho Kim · Hyung Jin Chang · Jin Young Choi,https://boeun-kim.github.io/page-MoST/,https://arxiv.org/abs/2403.06225,,2403.06225.pdf,MoST: Motion Style Transformer between Diverse Action Contents,"While existing motion style transfer methods are effective between two +motions with identical content, their performance significantly diminishes when +transferring style between motions with different contents. This challenge lies +in the lack of clear separation between content and style of a motion. To +tackle this challenge, we propose a novel motion style transformer that +effectively disentangles style from content and generates a plausible motion +with transferred style from a source motion. Our distinctive approach to +achieving the goal of disentanglement is twofold: (1) a new architecture for +motion style transformer with `part-attentive style modulator across body +parts' and `Siamese encoders that encode style and content features +separately'; (2) style disentanglement loss. Our method outperforms existing +methods and demonstrates exceptionally high quality, particularly in motion +pairs with different contents, without the need for heuristic post-processing. +Codes are available at https://github.com/Boeun-Kim/MoST.",cs.CV,"['cs.CV', 'cs.AI']" +Learned Lossless Image Compression based on Bit Plane Slicing,Zhe Zhang · Huairui Wang · Zhenzhong Chen · Shan Liu, ,https://arxiv.org/abs/2308.13287,,2308.13287.pdf,Efficient Learned Lossless JPEG Recompression,"JPEG is one of the most popular image compression methods. It is beneficial +to compress those existing JPEG files without introducing additional +distortion. In this paper, we propose a deep learning based method to further +compress JPEG images losslessly. Specifically, we propose a Multi-Level +Parallel Conditional Modeling (ML-PCM) architecture, which enables parallel +decoding in different granularities. First, luma and chroma are processed +independently to allow parallel coding. Second, we propose pipeline parallel +context model (PPCM) and compressed checkerboard context model (CCCM) for the +effective conditional modeling and efficient decoding within luma and chroma +components. Our method has much lower latency while achieves better compression +ratio compared with previous SOTA. After proper software optimization, we can +obtain a good throughput of 57 FPS for 1080P images on NVIDIA T4 GPU. +Furthermore, combined with quantization, our approach can also act as a lossy +JPEG codec which has obvious advantage over SOTA lossy compression methods in +high bit rate (bpp$>0.9$).",eess.IV,['eess.IV'] +Structure-Aware Sparse-View X-ray 3D Reconstruction,Yuanhao Cai · Jiahao Wang · Alan L. Yuille · Zongwei Zhou · Angtian Wang,https://github.com/caiyuanhao1998/SAX-NeRF,https://arxiv.org/abs/2311.10959,,2311.10959.pdf,Structure-Aware Sparse-View X-ray 3D Reconstruction,"X-ray, known for its ability to reveal internal structures of objects, is +expected to provide richer information for 3D reconstruction than visible +light. Yet, existing neural radiance fields (NeRF) algorithms overlook this +important nature of X-ray, leading to their limitations in capturing structural +contents of imaged objects. In this paper, we propose a framework, +Structure-Aware X-ray Neural Radiodensity Fields (SAX-NeRF), for sparse-view +X-ray 3D reconstruction. Firstly, we design a Line Segment-based Transformer +(Lineformer) as the backbone of SAX-NeRF. Linefomer captures internal +structures of objects in 3D space by modeling the dependencies within each line +segment of an X-ray. Secondly, we present a Masked Local-Global (MLG) ray +sampling strategy to extract contextual and geometric information in 2D +projection. Plus, we collect a larger-scale dataset X3D covering wider X-ray +applications. Experiments on X3D show that SAX-NeRF surpasses previous +NeRF-based methods by 12.56 and 2.49 dB on novel view synthesis and CT +reconstruction. Code, models, and data are released at +https://github.com/caiyuanhao1998/SAX-NeRF",eess.IV,"['eess.IV', 'cs.CV']" +Point Cloud Pre-training with Diffusion Models,xiao zheng · Xiaoshui Huang · Guofeng Mei · Zhaoyang Lyu · Yuenan Hou · Wanli Ouyang · Bo Dai · Yongshun Gong, ,https://arxiv.org/abs/2311.14960,,2311.14960.pdf,Point Cloud Pre-training with Diffusion Models,"Pre-training a model and then fine-tuning it on downstream tasks has +demonstrated significant success in the 2D image and NLP domains. However, due +to the unordered and non-uniform density characteristics of point clouds, it is +non-trivial to explore the prior knowledge of point clouds and pre-train a +point cloud backbone. In this paper, we propose a novel pre-training method +called Point cloud Diffusion pre-training (PointDif). We consider the point +cloud pre-training task as a conditional point-to-point generation problem and +introduce a conditional point generator. This generator aggregates the features +extracted by the backbone and employs them as the condition to guide the +point-to-point recovery from the noisy point cloud, thereby assisting the +backbone in capturing both local and global geometric priors as well as the +global point density distribution of the object. We also present a recurrent +uniform sampling optimization strategy, which enables the model to uniformly +recover from various noise levels and learn from balanced supervision. Our +PointDif achieves substantial improvement across various real-world datasets +for diverse downstream tasks such as classification, segmentation and +detection. Specifically, PointDif attains 70.0% mIoU on S3DIS Area 5 for the +segmentation task and achieves an average improvement of 2.4% on ScanObjectNN +for the classification task compared to TAP. Furthermore, our pre-training +framework can be flexibly applied to diverse point cloud backbones and bring +considerable gains.",cs.CV,['cs.CV'] +DiffLoc: Diffusion Model for Outdoor LiDAR Localization,Wen Li · Yuyang Yang · Shangshu Yu · Guosheng Hu · Chenglu Wen · Ming Cheng · Cheng Wang, ,,https://www.youtube.com/watch?v=sSW9nHQR0nc,,,,,nan +Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation,Shenshen Bu · Taiji Li · Zhiming Dai · Yuedong Yang,https://github.com/hnjzbss/EKAGen,https://arxiv.org/abs/2311.00399,,2311.00399.pdf,Enhanced Knowledge Injection for Radiology Report Generation,"Automatic generation of radiology reports holds crucial clinical value, as it +can alleviate substantial workload on radiologists and remind less experienced +ones of potential anomalies. Despite the remarkable performance of various +image captioning methods in the natural image field, generating accurate +reports for medical images still faces challenges, i.e., disparities in visual +and textual data, and lack of accurate domain knowledge. To address these +issues, we propose an enhanced knowledge injection framework, which utilizes +two branches to extract different types of knowledge. The Weighted Concept +Knowledge (WCK) branch is responsible for introducing clinical medical concepts +weighted by TF-IDF scores. The Multimodal Retrieval Knowledge (MRK) branch +extracts triplets from similar reports, emphasizing crucial clinical +information related to entity positions and existence. By integrating this +finer-grained and well-structured knowledge with the current image, we are able +to leverage the multi-source knowledge gain to ultimately facilitate more +accurate report generation. Extensive experiments have been conducted on two +public benchmarks, demonstrating that our method achieves superior performance +over other state-of-the-art methods. Ablation studies further validate the +effectiveness of two extracted knowledge sources.",cs.CV,"['cs.CV', 'cs.CL']" +Enhancing Multimodal Cooperation via Sample-level Modality Valuation,Yake Wei · Ruoxuan Feng · Zihe Wang · Di Hu, ,https://arxiv.org/html/2309.06255v3,,2309.06255v3.pdf,Enhancing Multimodal Cooperation via Fine-grained Modality Valuation,"One primary topic of multimodal learning is to jointly incorporate +heterogeneous information from different modalities. However, most models often +suffer from unsatisfactory multimodal cooperation, which cannot jointly utilize +all modalities well. Some methods are proposed to identify and enhance the +worse learnt modality, but they are often hard to provide the fine-grained +observation of multimodal cooperation at sample-level with theoretical support. +Hence, it is essential to reasonably observe and improve the fine-grained +cooperation between modalities, especially when facing realistic scenarios +where the modality discrepancy could vary across different samples. To this +end, we introduce a sample-level modality valuation metric to evaluate the +contribution of each modality for each sample. Via modality valuation, we +observe that modality discrepancy indeed could be different at sample-level, +beyond the global contribution discrepancy at dataset-level. We further analyze +this issue and improve cooperation between modalities at sample-level by +enhancing the discriminative ability of low-contributing modalities in a +targeted manner. Overall, our methods reasonably observe the fine-grained +uni-modal contribution and achieve considerable improvement. The source code +and dataset are available at +\url{https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']" +A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification,Zexian Yang · Dayan Wu · Chenming Wu · Zheng Lin · JingziGU · Weiping Wang, ,,https://ieeexplore.ieee.org/document/10301577,,,,,nan +Data-Efficient Multimodal Fusion on a Single GPU,Noël Vouitsis · Zhaoyan Liu · Satya Krishna Gorti · Valentin Villecroze · Jesse C. Cresswell · Guangwei Yu · Gabriel Loaiza-Ganem · Maksims Volkovs, ,https://arxiv.org/abs/2312.10144,,2312.10144.pdf,Data-Efficient Multimodal Fusion on a Single GPU,"The goal of multimodal alignment is to learn a single latent space that is +shared between multimodal inputs. The most powerful models in this space have +been trained using massive datasets of paired inputs and large-scale +computational resources, making them prohibitively expensive to train in many +practical scenarios. We surmise that existing unimodal encoders pre-trained on +large amounts of unimodal data should provide an effective bootstrap to create +multimodal models from unimodal ones at much lower costs. We therefore propose +FuseMix, a multimodal augmentation scheme that operates on the latent spaces of +arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal +alignment, we achieve competitive performance -- and in certain cases +outperform state-of-the art methods -- in both image-text and audio-text +retrieval, with orders of magnitude less compute and data: for example, we +outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \! +600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs. +Additionally, we show how our method can be applied to convert pre-trained +text-to-image generative models into audio-to-image ones. Code is available at: +https://github.com/layer6ai-labs/fusemix.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +SimDA: Simple Diffusion Adapter for Efficient Video Generation,Zhen Xing · Qi Dai · Han Hu · Zuxuan Wu · Yu-Gang Jiang, ,https://arxiv.org/abs/2308.09710,,2308.09710.pdf,SimDA: Simple Diffusion Adapter for Efficient Video Generation,"The recent wave of AI-generated content has witnessed the great development +and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video +(T2V) still falls short of expectations though attracting increasing interests. +Existing works either train from scratch or adapt large T2I model to videos, +both of which are computation and resource expensive. In this work, we propose +a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B +parameters of a strong T2I model, adapting it to video generation in a +parameter-efficient way. In particular, we turn the T2I model for T2V by +designing light-weight spatial and temporal adapters for transfer learning. +Besides, we change the original spatial attention to the proposed Latent-Shift +Attention (LSA) for temporal consistency. With similar model architecture, we +further train a video super-resolution model to generate high-definition +(1024x1024) videos. In addition to T2V generation in the wild, SimDA could also +be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our +method could minimize the training effort with extremely few tunable parameters +for model adaptation.",cs.CV,"['cs.CV', 'cs.AI']" +Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation,Tianshui Chen · Jianman Lin · Zhijing Yang · Chunmei Qing · Liang Lin, ,https://arxiv.org/abs/2401.11085,,2401.11085.pdf,Adaptive Global-Local Representation Learning and Selection for Cross-Domain Facial Expression Recognition,"Domain shift poses a significant challenge in Cross-Domain Facial Expression +Recognition (CD-FER) due to the distribution variation across different +domains. Current works mainly focus on learning domain-invariant features +through global feature adaptation, while neglecting the transferability of +local features. Additionally, these methods lack discriminative supervision +during training on target datasets, resulting in deteriorated feature +representation in target domain. To address these limitations, we propose an +Adaptive Global-Local Representation Learning and Selection (AGLRLS) framework. +The framework incorporates global-local adversarial adaptation and +semantic-aware pseudo label generation to enhance the learning of +domain-invariant and discriminative feature during training. Meanwhile, a +global-local prediction consistency learning is introduced to improve +classification results during inference. Specifically, the framework consists +of separate global-local adversarial learning modules that learn +domain-invariant global and local features independently. We also design a +semantic-aware pseudo label generation module, which computes semantic labels +based on global and local features. Moreover, a novel dynamic threshold +strategy is employed to learn the optimal thresholds by leveraging independent +prediction of global and local features, ensuring filtering out the unreliable +pseudo labels while retaining reliable ones. These labels are utilized for +model optimization through the adversarial learning process in an end-to-end +manner. During inference, a global-local prediction consistency module is +developed to automatically learn an optimal result from multiple predictions. +We conduct comprehensive experiments and analysis based on a fair evaluation +benchmark. The results demonstrate that the proposed framework outperforms the +current competing methods by a substantial margin.",cs.CV,"['cs.CV', 'cs.AI']" +Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow,Hanyu Zhou · Yi Chang · Zhiwei Shi,https://hyzhouboy.github.io/,https://arxiv.org/abs/2403.07432,,2403.07432.pdf,Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow,"Single RGB or LiDAR is the mainstream sensor for the challenging scene flow, +which relies heavily on visual features to match motion features. Compared with +single modality, existing methods adopt a fusion strategy to directly fuse the +cross-modal complementary knowledge in motion space. However, these direct +fusion methods may suffer the modality gap due to the visual intrinsic +heterogeneous nature between RGB and LiDAR, thus deteriorating motion features. +We discover that event has the homogeneous nature with RGB and LiDAR in both +visual and motion spaces. In this work, we bring the event as a bridge between +RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework +for scene flow, which explores a homogeneous space to fuse the cross-modal +complementary knowledge for physical interpretation. In visual fusion, we +discover that event has a complementarity (relative v.s. absolute) in luminance +space with RGB for high dynamic imaging, and has a complementarity (local +boundary v.s. global shape) in scene structure space with LiDAR for structure +integrity. In motion fusion, we figure out that RGB, event and LiDAR are +complementary (spatial-dense, temporal-dense v.s. spatiotemporal-sparse) to +each other in correlation space, which motivates us to fuse their motion +correlations for motion continuity. The proposed hierarchical fusion can +explicitly fuse the multimodal knowledge to progressively improve scene flow +from visual space to motion space. Extensive experiments have been performed to +verify the superiority of the proposed method.",cs.CV,['cs.CV'] +Efficient Vision-Language Pre-training by Cluster Masking,Zihao Wei · Zixuan Pan · Andrew Owens, ,https://arxiv.org/abs/2405.08815,,2405.08815.pdf,Efficient Vision-Language Pre-training by Cluster Masking,"We propose a simple strategy for masking image patches during visual-language +contrastive learning that improves the quality of the learned representations +and the training speed. During each iteration of training, we randomly mask +clusters of visually similar image patches, as measured by their raw pixel +intensities. This provides an extra learning signal, beyond the contrastive +training itself, since it forces a model to predict words for masked visual +structures solely from context. It also speeds up training by reducing the +amount of data used in each image. We evaluate the effectiveness of our model +by pre-training on a number of benchmarks, finding that it outperforms other +masking strategies, such as FLIP, on the quality of the learned representation.",cs.CV,['cs.CV'] +Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction,Cheng Sun · Wei-En Tai · Yu-Lin Shih · Kuan-Wei Chen · Yong-Jing Syu · Kent Selwyn The · Yu-Chiang Frank Wang · Hwann-Tzong Chen, ,https://arxiv.org/abs/2311.18695v1,,2311.18695v1.pdf,Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction,"State-of-the-art single-view 360-degree room layout reconstruction methods +formulate the problem as a high-level 1D (per-column) regression task. On the +other hand, traditional low-level 2D layout segmentation is simpler to learn +and can represent occluded regions, but it requires complex post-processing for +the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to +render 1D layout depth regression from the 2D segmentation map in a +differentiable and occlusion-aware way, marrying the merits of both sides. +Specifically, our model predicts floor-plan density for the input +equirectangular 360-degree image. Formulating the 2D layout representation as a +density field enables us to employ `flattened' volume rendering to form 1D +layout depth regression. In addition, we propose a novel 3D warping +augmentation on layout to improve generalization. Finally, we re-implement +recent room layout reconstruction methods into our codebase for benchmarking +and explore modern backbones and training techniques to serve as the strong +baseline. Our model significantly outperforms previous arts. The code will be +made available upon publication.",cs.CV,"['cs.CV', 'cs.LG']" +Patch2Self2: Self-supervised Denoising on Coresets via Matrix Sketching,Shreyas Fadnavis · Agniva Chowdhury · Joshua Batson · Petros Drineas · Eleftherios Garyfallidis, ,,,,,,,nan +X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition,Shuofeng Sun · Yongming Rao · Jiwen Lu · Haibin Yan,https://github.com/sunshuofeng/X-3D,https://arxiv.org/abs/2404.15010,,2404.15010.pdf,X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition,"Numerous prior studies predominantly emphasize constructing relation vectors +for individual neighborhood points and generating dynamic kernels for each +vector and embedding these into high-dimensional spaces to capture implicit +local structures. However, we contend that such implicit high-dimensional +structure modeling approch inadequately represents the local geometric +structure of point clouds due to the absence of explicit structural +information. Hence, we introduce X-3D, an explicit 3D structure modeling +approach. X-3D functions by capturing the explicit local structural information +within the input 3D space and employing it to produce dynamic kernels with +shared weights for all neighborhood points within the current local region. +This modeling approach introduces effective geometric prior and significantly +diminishes the disparity between the local structure of the embedding space and +the original input point cloud, thereby improving the extraction of local +features. Experiments show that our method can be used on a variety of methods +and achieves state-of-the-art performance on segmentation, classification, +detection tasks with lower extra computational cost, such as \textbf{90.7\%} on +ScanObjectNN for classification, \textbf{79.2\%} on S3DIS 6 fold and +\textbf{74.3\%} on S3DIS Area 5 for segmentation, \textbf{76.3\%} on ScanNetV2 +for segmentation and \textbf{64.5\%} mAP , \textbf{46.9\%} mAP on SUN RGB-D and +\textbf{69.0\%} mAP , \textbf{51.1\%} mAP on ScanNetV2 . Our code is available +at +\href{https://github.com/sunshuofeng/X-3D}{https://github.com/sunshuofeng/X-3D}.",cs.CV,['cs.CV'] +NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation,Sicheng Li · Hao Li · Yiyi Liao · Lu Yu,https://jasonlsc.github.io/nerfcodec_homepage/,https://arxiv.org/abs/2404.02185,,2404.02185.pdf,NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation,"The emergence of Neural Radiance Fields (NeRF) has greatly impacted 3D scene +modeling and novel-view synthesis. As a kind of visual media for 3D scene +representation, compression with high rate-distortion performance is an eternal +target. Motivated by advances in neural compression and neural field +representation, we propose NeRFCodec, an end-to-end NeRF compression framework +that integrates non-linear transform, quantization, and entropy coding for +memory-efficient scene representation. Since training a non-linear transform +directly on a large scale of NeRF feature planes is impractical, we discover +that pre-trained neural 2D image codec can be utilized for compressing the +features when adding content-specific parameters. Specifically, we reuse neural +2D image codec but modify its encoder and decoder heads, while keeping the +other parts of the pre-trained decoder frozen. This allows us to train the full +pipeline via supervision of rendering loss and entropy loss, yielding the +rate-distortion balance by updating the content-specific parameters. At test +time, the bitstreams containing latent code, feature decoder head, and other +side information are transmitted for communication. Experimental results +demonstrate our method outperforms existing NeRF compression methods, enabling +high-quality novel view synthesis with a memory budget of 0.5 MB.",cs.CV,"['cs.CV', 'cs.GR', 'eess.IV']" +Desigen: A Pipeline for Controllable Design Template Generation,Haohan Weng · Danqing Huang · YU QIAO · Hu Zheng · Chin-Yew Lin · Tong Zhang · C. L. Philip Chen, ,https://arxiv.org/html/2403.09093v1,,2403.09093v1.pdf,Desigen: A Pipeline for Controllable Design Template Generation,"Templates serve as a good starting point to implement a design (e.g., banner, +slide) but it takes great effort from designers to manually create. In this +paper, we present Desigen, an automatic template creation pipeline which +generates background images as well as harmonious layout elements over the +background. Different from natural images, a background image should preserve +enough non-salient space for the overlaying layout elements. To equip existing +advanced diffusion-based models with stronger spatial control, we propose two +simple but effective techniques to constrain the saliency distribution and +reduce the attention weight in desired regions during the background generation +process. Then conditioned on the background, we synthesize the layout with a +Transformer-based autoregressive generator. To achieve a more harmonious +composition, we propose an iterative inference strategy to adjust the +synthesized background and layout in multiple rounds. We constructed a design +dataset with more than 40k advertisement banners to verify our approach. +Extensive experiments demonstrate that the proposed pipeline generates +high-quality templates comparable to human designers. More than a single-page +design, we further show an application of presentation generation that outputs +a set of theme-consistent slides. The data and code are available at +https://whaohan.github.io/desigen.",cs.CV,['cs.CV'] +"Sparse views, Near light: A practical paradigm for uncalibrated point-light photometric stereo",Mohammed Brahimi · Bjoern Haefner · Zhenzhang Ye · Bastian Goldluecke · Daniel Cremers, ,https://arxiv.org/abs/2404.00098,,2404.00098.pdf,"Sparse Views, Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo","Neural approaches have shown a significant progress on camera-based +reconstruction. But they require either a fairly dense sampling of the viewing +sphere, or pre-training on an existing dataset, thereby limiting their +generalizability. In contrast, photometric stereo (PS) approaches have shown +great potential for achieving high-quality reconstruction under sparse +viewpoints. Yet, they are impractical because they typically require tedious +laboratory conditions, are restricted to dark rooms, and often multi-staged, +making them subject to accumulated errors. To address these shortcomings, we +propose an end-to-end uncalibrated multi-view PS framework for reconstructing +high-resolution shapes acquired from sparse viewpoints in a real-world +environment. We relax the dark room assumption, and allow a combination of +static ambient lighting and dynamic near LED lighting, thereby enabling easy +data capture outside the lab. Experimental validation confirms that it +outperforms existing baseline approaches in the regime of sparse viewpoints by +a large margin. This allows to bring high-accuracy 3D reconstruction from the +dark room to the real world, while maintaining a reasonable data capture +complexity.",cs.CV,['cs.CV'] +LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example,Soyeon Yoon · Kwan Yun · Kwanggyoon Seo · Sihun Cha · Jung Eun Yoo · Junyong Noh,https://kwanyun.github.io/lego/,https://arxiv.org/abs/2403.15227,,2403.15227.pdf,LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example,"Recent advances in 3D face stylization have made significant strides in few +to zero-shot settings. However, the degree of stylization achieved by existing +methods is often not sufficient for practical applications because they are +mostly based on statistical 3D Morphable Models (3DMM) with limited variations. +To this end, we propose a method that can produce a highly stylized 3D face +model with desired topology. Our methods train a surface deformation network +with 3DMM and translate its domain to the target style using a paired exemplar. +The network achieves stylization of the 3D face mesh by mimicking the style of +the target using a differentiable renderer and directional CLIP losses. +Additionally, during the inference process, we utilize a Mesh Agnostic Encoder +(MAGE) that takes deformation target, a mesh of diverse topologies as input to +the stylization process and encodes its shape into our latent space. The +resulting stylized face model can be animated by commonly used 3DMM blend +shapes. A set of quantitative and qualitative evaluations demonstrate that our +method can produce highly stylized face meshes according to a given style and +output them in a desired topology. We also demonstrate example applications of +our method including image-based stylized avatar generation, linear +interpolation of geometric styles, and facial animation of stylized avatars.",cs.CV,"['cs.CV', 'cs.GR', '68T45', 'I.4.9']" +Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models,Chang Liu · Haoning Wu · Yujie Zhong · Xiaoyun Zhang · Yanfeng Wang · Weidi Xie,https://haoningwu3639.github.io/StoryGen_Webpage/,https://ar5iv.labs.arxiv.org/html/2312.03884,,2312.03884.pdf,WonderJourney: Going from Anywhere to Everywhere,"We introduce WonderJourney, a modularized framework for perpetual 3D scene +generation. Unlike prior work on view generation that focuses on a single type +of scenes, we start at any user-provided location (by a text description or an +image) and generate a journey through a long sequence of diverse yet coherently +connected 3D scenes. We leverage an LLM to generate textual descriptions of the +scenes in this journey, a text-driven point cloud generation pipeline to make a +compelling and coherent sequence of 3D scenes, and a large VLM to verify the +generated scenes. We show compelling, diverse visual results across various +scene types and styles, forming imaginary ""wonderjourneys"". Project website: +https://kovenyu.com/WonderJourney/",cs.CV,"['cs.CV', 'cs.GR']" +CSTA: CNN-based Spatiotemporal Attention for Video Summarization,Jaewon Son · Jaehun Park · Kwangsu Kim,https://github.com/thswodnjs3/CSTA,https://arxiv.org/abs/2405.11905,,2405.11905.pdf,CSTA: CNN-based Spatiotemporal Attention for Video Summarization,"Video summarization aims to generate a concise representation of a video, +capturing its essential content and key moments while reducing its overall +length. Although several methods employ attention mechanisms to handle +long-term dependencies, they often fail to capture the visual significance +inherent in frames. To address this limitation, we propose a CNN-based +SpatioTemporal Attention (CSTA) method that stacks each feature of frames from +a single video to form image-like frame representations and applies 2D CNN to +these frame features. Our methodology relies on CNN to comprehend the inter and +intra-frame relations and to find crucial attributes in videos by exploiting +its ability to learn absolute positions within images. In contrast to previous +work compromising efficiency by designing additional modules to focus on +spatial importance, CSTA requires minimal computational overhead as it uses CNN +as a sliding window. Extensive experiments on two benchmark datasets (SumMe and +TVSum) demonstrate that our proposed approach achieves state-of-the-art +performance with fewer MACs compared to previous methods. Codes are available +at https://github.com/thswodnjs3/CSTA.",cs.CV,['cs.CV'] +LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP,Yunshi HUANG · Fereshteh Shakeri · Jose Dolz · Malik Boudiaf · Houda Bahig · Ismail Ben Ayed, ,https://arxiv.org/abs/2404.02285,,2404.02285.pdf,LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP,"In a recent, strongly emergent literature on few-shot CLIP adaptation, Linear +Probe (LP) has been often reported as a weak baseline. This has motivated +intensive research building convoluted prompt learning or feature adaptation +strategies. In this work, we propose and examine from convex-optimization +perspectives a generalization of the standard LP baseline, in which the linear +classifier weights are learnable functions of the text embedding, with +class-wise multipliers blending image and text knowledge. As our objective +function depends on two types of variables, i.e., the class visual prototypes +and the learnable blending parameters, we propose a computationally efficient +block coordinate Majorize-Minimize (MM) descent algorithm. In our full-batch MM +optimizer, which we coin LP++, step sizes are implicit, unlike standard +gradient descent practices where learning rates are intensively searched over +validation sets. By examining the mathematical properties of our loss (e.g., +Lipschitz gradient continuity), we build majorizing functions yielding +data-driven learning rates and derive approximations of the loss's minima, +which provide data-informed initialization of the variables. Our image-language +objective function, along with these non-trivial optimization insights and +ingredients, yields, surprisingly, highly competitive few-shot CLIP +performances. Furthermore, LP++ operates in black-box, relaxes intensive +validation searches for the optimization hyper-parameters, and runs +orders-of-magnitudes faster than state-of-the-art few-shot CLIP adaptation +methods. Our code is available at: +\url{https://github.com/FereshteShakeri/FewShot-CLIP-Strong-Baseline.git}.",cs.CV,['cs.CV'] +EarthLoc: Astronaut Photography Localization by Indexing Earth from Space,Gabriele Berton · Alex Stoken · Barbara Caputo · Carlo Masone,https://github.com/gmberton/EarthLoc,https://arxiv.org/abs/2403.06758,,2403.06758.pdf,EarthLoc: Astronaut Photography Localization by Indexing Earth from Space,"Astronaut photography, spanning six decades of human spaceflight, presents a +unique Earth observations dataset with immense value for both scientific +research and disaster response. Despite its significance, accurately localizing +the geographical extent of these images, crucial for effective utilization, +poses substantial challenges. Current manual localization efforts are +time-consuming, motivating the need for automated solutions. We propose a novel +approach - leveraging image retrieval - to address this challenge efficiently. +We introduce innovative training techniques, including Year-Wise Data +Augmentation and a Neutral-Aware Multi-Similarity Loss, which contribute to the +development of a high-performance model, EarthLoc. We develop six evaluation +datasets and perform a comprehensive benchmark comparing EarthLoc to existing +methods, showcasing its superior efficiency and accuracy. Our approach marks a +significant advancement in automating the localization of astronaut +photography, which will help bridge a critical gap in Earth observations data. +Code and datasets are available at https://github.com/gmberton/EarthLoc",cs.CV,['cs.CV'] +Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis,Jiawen Li · Yuxuan Chen · Hongbo Chu · Sun Qiehe · Tian Guan · Anjia Han · Yonghong He, ,https://arxiv.org/abs/2403.07719,,2403.07719.pdf,Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis,"Histopathological whole slide images (WSIs) classification has become a +foundation task in medical microscopic imaging processing. Prevailing +approaches involve learning WSIs as instance-bag representations, emphasizing +significant instances but struggling to capture the interactions between +instances. Additionally, conventional graph representation methods utilize +explicit spatial positions to construct topological structures but restrict the +flexible interaction capabilities between instances at arbitrary locations, +particularly when spatially distant. In response, we propose a novel dynamic +graph representation algorithm that conceptualizes WSIs as a form of the +knowledge graph structure. Specifically, we dynamically construct neighbors and +directed edge embeddings based on the head and tail relationships between +instances. Then, we devise a knowledge-aware attention mechanism that can +update the head node features by learning the joint attention score of each +neighbor and edge. Finally, we obtain a graph-level embedding through the +global pooling process of the updated head, serving as an implicit +representation for the WSI classification. Our end-to-end graph representation +learning approach has outperformed the state-of-the-art WSI analysis methods on +three TCGA benchmark datasets and in-house test sets. Our code is available at +https://github.com/WonderLandxD/WiKG.",cs.CV,['cs.CV'] +Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary,Leheng Zhang · Yawei Li · Xingyu Zhou · Xiaorui Zhao · Shuhang Gu, ,https://arxiv.org/abs/2401.08209,,2401.08209.pdf,Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary,"Single Image Super-Resolution is a classic computer vision problem that +involves estimating high-resolution (HR) images from low-resolution (LR) ones. +Although deep neural networks (DNNs), especially Transformers for +super-resolution, have seen significant advancements in recent years, +challenges still remain, particularly in limited receptive field caused by +window-based self-attention. To address these issues, we introduce a group of +auxiliary Adaptive Token Dictionary to SR Transformer and establish an ATD-SR +method. The introduced token dictionary could learn prior information from +training data and adapt the learned prior to specific testing image through an +adaptive refinement step. The refinement strategy could not only provide global +information to all input tokens but also group image tokens into categories. +Based on category partitions, we further propose a category-based +self-attention mechanism designed to leverage distant but similar tokens for +enhancing input features. The experimental results show that our method +achieves the best performance on various single image super-resolution +benchmarks.",cs.CV,['cs.CV'] +Learning to Select Views for Efficient Multi-View Understanding,Yunzhong Hou · Stephen Gould · Liang Zheng, ,,https://openreview.net/forum?id=mzWQ2hOKNX,,,,,nan +MedBN: Robust Test-Time Adaptation against Malicious Test Samples,Hyejin Park · Jeongyeon Hwang · Sunung Mun · Sangdon Park · Jungseul Ok,http://hyejin-s.github.io/medbn,https://arxiv.org/abs/2403.19326,,2403.19326.pdf,MedBN: Robust Test-Time Adaptation against Malicious Test Samples,"Test-time adaptation (TTA) has emerged as a promising solution to address +performance decay due to unforeseen distribution shifts between training and +test data. While recent TTA methods excel in adapting to test data variations, +such adaptability exposes a model to vulnerability against malicious examples, +an aspect that has received limited attention. Previous studies have uncovered +security vulnerabilities within TTA even when a small proportion of the test +batch is maliciously manipulated. In response to the emerging threat, we +propose median batch normalization (MedBN), leveraging the robustness of the +median for statistics estimation within the batch normalization layer during +test-time inference. Our method is algorithm-agnostic, thus allowing seamless +integration with existing TTA frameworks. Our experimental results on benchmark +datasets, including CIFAR10-C, CIFAR100-C and ImageNet-C, consistently +demonstrate that MedBN outperforms existing approaches in maintaining robust +performance across different attack scenarios, encompassing both instant and +cumulative attacks. Through extensive experiments, we show that our approach +sustains the performance even in the absence of attacks, achieving a practical +balance between robustness and performance.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']" +Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis,Tianci Bi · Xiaoyi Zhang · Zhizheng Zhang · Wenxuan Xie · Cuiling Lan · Yan Lu · Nanning Zheng, ,https://arxiv.org/abs/2405.07481,,2405.07481.pdf,Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis,"Significant progress has been made in scene text detection models since the +rise of deep learning, but scene text layout analysis, which aims to group +detected text instances as paragraphs, has not kept pace. Previous works either +treated text detection and grouping using separate models, or train a model +from scratch while using a unified one. All of them have not yet made full use +of the already well-trained text detectors and easily obtainable detection +datasets. In this paper, we present Text Grouping Adapter (TGA), a module that +can enable the utilization of various pre-trained text detectors to learn +layout analysis, allowing us to adopt a well-trained text detector right off +the shelf or just fine-tune it efficiently. Designed to be compatible with +various text detector architectures, TGA takes detected text regions and image +features as universal inputs to assemble text instance features. To capture +broader contextual information for layout analysis, we propose to predict text +group masks from text instance features by one-to-many assignment. Our +comprehensive experiments demonstrate that, even with frozen pre-trained +models, incorporating our TGA into various pre-trained text detectors and text +spotters can achieve superior layout analysis performance, simultaneously +inheriting generalized text detection ability from pre-training. In the case of +full parameter fine-tuning, we can further improve layout analysis performance.",cs.CV,['cs.CV'] +Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement,Jinyoung Jun · Jae-Han Lee · Chang-Su Kim, ,https://arxiv.org/abs/2404.19294,,2404.19294.pdf,Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement,"The main function of depth completion is to compensate for an insufficient +and unpredictable number of sparse depth measurements of hardware sensors. +However, existing research on depth completion assumes that the sparsity -- the +number of points or LiDAR lines -- is fixed for training and testing. Hence, +the completion performance drops severely when the number of sparse depths +changes significantly. To address this issue, we propose the sparsity-adaptive +depth refinement (SDR) framework, which refines monocular depth estimates using +sparse depth points. For SDR, we propose the masked spatial propagation network +(MSPN) to perform SDR with a varying number of sparse depths effectively by +gradually propagating sparse depth information throughout the entire depth map. +Experimental results demonstrate that MPSN achieves state-of-the-art +performance on both SDR and conventional depth completion scenarios.",cs.CV,['cs.CV'] +Depth-aware Test-Time Training for Zero-shot Video Object Segmentation,Weihuang Liu · Xi Shen · Haolun Li · Xiuli Bi · Bo Liu · Chi-Man Pun · Xiaodong Cun,https://nifangbaage.github.io/DATTT/,https://arxiv.org/abs/2403.04258,,2403.04258.pdf,Depth-aware Test-Time Training for Zero-shot Video Object Segmentation,"Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary +moving object without any human annotations. Mainstream solutions mainly focus +on learning a single model on large-scale video datasets, which struggle to +generalize to unseen videos. In this work, we introduce a test-time training +(TTT) strategy to address the problem. Our key insight is to enforce the model +to predict consistent depth during the TTT process. In detail, we first train a +single network to perform both segmentation and depth prediction tasks. This +can be effectively learned with our specifically designed depth modulation +layer. Then, for the TTT process, the model is updated by predicting consistent +depth maps for the same frame under different data augmentations. In addition, +we explore different TTT weight updating strategies. Our empirical results +suggest that the momentum-based weight initialization and looping-based +training scheme lead to more stable improvements. Experiments show that the +proposed method achieves clear improvements on ZSVOS. Our proposed video TTT +strategy provides significant superiority over state-of-the-art TTT methods. +Our code is available at: https://nifangbaage.github.io/DATTT.",cs.CV,['cs.CV'] +Viewpoint-Aware Visual Grounding in 3D Scenes,Xiangxi Shi · Zhonghua Wu · Stefan Lee, ,https://arxiv.org/abs/2403.03077,,2403.03077.pdf,MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding,"3D visual grounding involves matching natural language descriptions with +their corresponding objects in 3D spaces. Existing methods often face +challenges with accuracy in object recognition and struggle in interpreting +complex linguistic queries, particularly with descriptions that involve +multiple anchors or are view-dependent. In response, we present the MiKASA +(Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model +integrates a self-attention-based scene-aware object encoder and an original +multi-key-anchor technique, enhancing object recognition accuracy and the +understanding of spatial relationships. Furthermore, MiKASA improves the +explainability of decision-making, facilitating error diagnosis. Our model +achieves the highest overall accuracy in the Referit3D challenge for both the +Sr3D and Nr3D datasets, particularly excelling by a large margin in categories +that require viewpoint-dependent descriptions.",cs.CV,['cs.CV'] +Training Vision Transformers for Semi-Supervised Semantic Segmentation,Xinting Hu · Li Jiang · Bernt Schiele, ,,https://github.com/JoyHuYY1412/S4Former,,,,,nan +DeconfuseTrack: Dealing with Confusion for Multi-Object Tracking,Cheng Huang · Shoudong Han · Mengyu He · Wenbo Zheng · Yuhao Wei, ,https://arxiv.org/abs/2403.02767,,2403.02767.pdf,DeconfuseTrack:Dealing with Confusion for Multi-Object Tracking,"Accurate data association is crucial in reducing confusion, such as ID +switches and assignment errors, in multi-object tracking (MOT). However, +existing advanced methods often overlook the diversity among trajectories and +the ambiguity and conflicts present in motion and appearance cues, leading to +confusion among detections, trajectories, and associations when performing +simple global data association. To address this issue, we propose a simple, +versatile, and highly interpretable data association approach called Decomposed +Data Association (DDA). DDA decomposes the traditional association problem into +multiple sub-problems using a series of non-learning-based modules and +selectively addresses the confusion in each sub-problem by incorporating +targeted exploitation of new cues. Additionally, we introduce Occlusion-aware +Non-Maximum Suppression (ONMS) to retain more occluded detections, thereby +increasing opportunities for association with trajectories and indirectly +reducing the confusion caused by missed detections. Finally, based on DDA and +ONMS, we design a powerful multi-object tracker named DeconfuseTrack, +specifically focused on resolving confusion in MOT. Extensive experiments +conducted on the MOT17 and MOT20 datasets demonstrate that our proposed DDA and +ONMS significantly enhance the performance of several popular trackers. +Moreover, DeconfuseTrack achieves state-of-the-art performance on the MOT17 and +MOT20 test sets, significantly outperforms the baseline tracker ByteTrack in +metrics such as HOTA, IDF1, AssA. This validates that our tracking design +effectively reduces confusion caused by simple global association.",cs.CV,['cs.CV'] +SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis,Ziqiao Peng · Wentao Hu · Yue Shi · Xiangyu Zhu · Xiaomei Zhang · Hao Zhao · Jun He · Hongyan Liu · Zhaoxin Fan,https://ziqiaopeng.github.io/synctalk/,https://arxiv.org/html/2311.17590v2,,2311.17590v2.pdf,SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis,"Achieving high synchronization in the synthesis of realistic, speech-driven +talking head videos presents a significant challenge. Traditional Generative +Adversarial Networks (GAN) struggle to maintain consistent facial identity, +while Neural Radiance Fields (NeRF) methods, although they can address this +issue, often produce mismatched lip movements, inadequate facial expressions, +and unstable head poses. A lifelike talking head requires synchronized +coordination of subject identity, lip movements, facial expressions, and head +poses. The absence of these synchronizations is a fundamental flaw, leading to +unrealistic and artificial outcomes. To address the critical issue of +synchronization, identified as the ""devil"" in creating realistic talking heads, +we introduce SyncTalk. This NeRF-based method effectively maintains subject +identity, enhancing synchronization and realism in talking head synthesis. +SyncTalk employs a Face-Sync Controller to align lip movements with speech and +innovatively uses a 3D facial blendshape model to capture accurate facial +expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more +natural head movements. The Portrait-Sync Generator restores hair details and +blends the generated head with the torso for a seamless visual experience. +Extensive experiments and user studies demonstrate that SyncTalk outperforms +state-of-the-art methods in synchronization and realism. We recommend watching +the supplementary video: https://ziqiaopeng.github.io/synctalk",cs.CV,['cs.CV'] +Efficient Test-Time Adaptation of Vision-Language Models,Adilbek Karmanov · Dayan Guan · Shijian Lu · Abdulmotaleb El Saddik · Eric P. Xing,https://kdiaaa.github.io/tda/,https://arxiv.org/abs/2403.18293,,2403.18293.pdf,Efficient Test-Time Adaptation of Vision-Language Models,"Test-time adaptation with pre-trained vision-language models has attracted +increasing attention for tackling distribution shifts during the test time. +Though prior studies have achieved very promising performance, they involve +intensive computation which is severely unaligned with test-time adaptation. We +design TDA, a training-free dynamic adapter that enables effective and +efficient test-time adaptation with vision-language models. TDA works with a +lightweight key-value cache that maintains a dynamic queue with few-shot pseudo +labels as values and the corresponding test-sample features as keys. Leveraging +the key-value cache, TDA allows adapting to test data gradually via progressive +pseudo label refinement which is super-efficient without incurring any +backpropagation. In addition, we introduce negative pseudo labeling that +alleviates the adverse impact of pseudo label noises by assigning pseudo labels +to certain negative classes when the model is uncertain about its pseudo label +predictions. Extensive experiments over two benchmarks demonstrate TDA's +superior effectiveness and efficiency as compared with the state-of-the-art. +The code has been released in \url{https://kdiaaa.github.io/tda/}.",cs.CV,['cs.CV'] +Systematic comparison of semi-supervised and self-supervised learning for medical image classification,Zhe Huang · Ruijie Jiang · Shuchin Aeron · Michael C. Hughes, ,https://arxiv.org/abs/2307.08919v2,,2307.08919v2.pdf,Systematic comparison of semi-supervised and self-supervised learning for medical image classification,"In many medical image classification problems, labeled data is scarce while +unlabeled data is more available. Semi-supervised learning and self-supervised +learning are two different research directions that can improve accuracy by +learning from extra unlabeled data. Recent methods from both directions have +reported significant gains on traditional benchmarks. Yet past benchmarks do +not focus on medical tasks and rarely compare self- and semi- methods together +on equal footing. Furthermore, past benchmarks often handle hyperparameter +tuning suboptimally. First, they may not tune hyperparameters at all, leading +to underfitting. Second, when tuning does occur, it often unrealistically uses +a labeled validation set much larger than the train set. Both cases make +previously published rankings of methods difficult to translate to practical +settings. This study contributes a systematic evaluation of self- and semi- +methods with a unified experimental protocol intended to guide a practitioner +with scarce overall labeled data and a limited compute budget. We answer two +key questions: Can hyperparameter tuning be effective with realistic-sized +validation sets? If so, when all methods are tuned well, which self- or +semi-supervised methods reach the best accuracy? Our study compares 13 +representative semi- and self-supervised methods to strong labeled-set-only +baselines on 4 medical datasets. From 20000+ total GPU hours of computation, we +provide valuable best practices to resource-constrained, results-focused +practitioners.",cs.CV,"['cs.CV', 'cs.LG']" +Grounded Text-to-Image Synthesis with Attention Refocusing,Quynh Phung · Songwei Ge · Jia-Bin Huang, ,https://arxiv.org/abs/2306.05427,,2306.05427.pdf,Grounded Text-to-Image Synthesis with Attention Refocusing,"Driven by the scalable diffusion models trained on large-scale datasets, +text-to-image synthesis methods have shown compelling results. However, these +models still fail to precisely follow the text prompt involving multiple +objects, attributes, or spatial compositions. In this paper, we reveal the +potential causes in the diffusion model's cross-attention and self-attention +layers. We propose two novel losses to refocus attention maps according to a +given spatial layout during sampling. Creating the layouts manually requires +additional effort and can be tedious. Therefore, we explore using large +language models (LLM) to produce these layouts for our method. We conduct +extensive experiments on the DrawBench, HRS, and TIFA benchmarks to evaluate +our proposed method. We show that our proposed attention refocusing effectively +improves the controllability of existing approaches.",cs.CV,['cs.CV'] +"Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention, Alignment and Prompt Tuning",Leslie Ching Ow Tiong · Dick Sigmund · Chen-Hui Chan · Andrew Beng Jin Teoh,https://github.com/MIS-DevWorks/FBR,,https://mdpi-res.com/d_attachment/sensors/sensors-23-06006/article_deploy/sensors-23-06006.pdf?version=1687952937,,,,,nan +Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias,Wenyu Zhang · Qingmu Liu · Felix Ong · Mohamed Ragab · Chuan-Sheng Foo, ,https://arxiv.org/abs/2403.11234,,2403.11234.pdf,Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias,"Domain adaptation is a critical task in machine learning that aims to improve +model performance on a target domain by leveraging knowledge from a related +source domain. In this work, we introduce Universal Semi-Supervised Domain +Adaptation (UniSSDA), a practical yet challenging setting where the target +domain is partially labeled, and the source and target label space may not +strictly match. UniSSDA is at the intersection of Universal Domain Adaptation +(UniDA) and Semi-Supervised Domain Adaptation (SSDA): the UniDA setting does +not allow for fine-grained categorization of target private classes not +represented in the source domain, while SSDA focuses on the restricted +closed-set setting where source and target label spaces match exactly. Existing +UniDA and SSDA methods are susceptible to common-class bias in UniSSDA +settings, where models overfit to data distributions of classes common to both +domains at the expense of private classes. We propose a new prior-guided +pseudo-label refinement strategy to reduce the reinforcement of common-class +bias due to pseudo-labeling, a common label propagation strategy in domain +adaptation. We demonstrate the effectiveness of the proposed strategy on +benchmark datasets Office-Home, DomainNet, and VisDA. The proposed strategy +attains the best performance across UniSSDA adaptation settings and establishes +a new baseline for UniSSDA.",cs.CV,['cs.CV'] +OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation,Bohao Peng · Xiaoyang Wu · Li Jiang · Yukang Chen · Hengshuang Zhao · Zhuotao Tian · Jiaya Jia, ,https://arxiv.org/abs/2403.14418,,2403.14418.pdf,OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation,"The booming of 3D recognition in the 2020s began with the introduction of +point cloud transformers. They quickly overwhelmed sparse CNNs and became +state-of-the-art models, especially in 3D semantic segmentation. However, +sparse CNNs are still valuable networks, due to their efficiency treasure, and +ease of application. In this work, we reexamine the design distinctions and +test the limits of what a sparse CNN can achieve. We discover that the key +credit to the performance difference is adaptivity. Specifically, we propose +two key components, i.e., adaptive receptive fields (spatially) and adaptive +relation, to bridge the gap. This exploration led to the creation of +Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a +lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal +computational cost. Without any self-attention modules, OA-CNNs favorably +surpass point transformers in terms of accuracy in both indoor and outdoor +scenes, with much less latency and memory cost. Notably, it achieves 76.1%, +78.9%, and 70.6% mIoU on ScanNet v2, nuScenes, and SemanticKITTI validation +benchmarks respectively, while maintaining at most 5x better speed than +transformer counterparts. This revelation highlights the potential of pure +sparse CNNs to outperform transformer-related networks.",cs.CV,['cs.CV'] +Unified Language-driven Zero-shot Domain Adaptation,Senqiao Yang · Zhuotao Tian · Li Jiang · Jiaya Jia, ,https://arxiv.org/abs/2404.07155,,2404.07155.pdf,Unified Language-driven Zero-shot Domain Adaptation,"This paper introduces Unified Language-driven Zero-shot Domain Adaptation +(ULDA), a novel task setting that enables a single model to adapt to diverse +target domains without explicit domain-ID knowledge. We identify the +constraints in the existing language-driven zero-shot domain adaptation task, +particularly the requirement for domain IDs and domain-specific models, which +may restrict flexibility and scalability. To overcome these issues, we propose +a new framework for ULDA, consisting of Hierarchical Context Alignment (HCA), +Domain Consistent Representation Learning (DCRL), and Text-Driven Rectifier +(TDR). These components work synergistically to align simulated features with +target text across multiple visual levels, retain semantic correlations between +different regional representations, and rectify biases between simulated and +real target visual features, respectively. Our extensive empirical evaluations +demonstrate that this framework achieves competitive performance in both +settings, surpassing even the model that requires domain-ID, showcasing its +superiority and generalization ability. The proposed method is not only +effective but also maintains practicality and efficiency, as it does not +introduce additional computational costs during inference. Our project page is +https://senqiaoyang.com/project/ULDA .",cs.CV,['cs.CV'] +Delving into the Trajectory Long-tail Distribution for Muti-object Tracking,Sijia Chen · En Yu · Jinyang Li · Wenbing Tao, ,https://arxiv.org/abs/2403.04700,,2403.04700.pdf,Delving into the Trajectory Long-tail Distribution for Muti-object Tracking,"Multiple Object Tracking (MOT) is a critical area within computer vision, +with a broad spectrum of practical implementations. Current research has +primarily focused on the development of tracking algorithms and enhancement of +post-processing techniques. Yet, there has been a lack of thorough examination +concerning the nature of tracking data it self. In this study, we pioneer an +exploration into the distribution patterns of tracking data and identify a +pronounced long-tail distribution issue within existing MOT datasets. We note a +significant imbalance in the distribution of trajectory lengths across +different pedestrians, a phenomenon we refer to as ``pedestrians trajectory +long-tail distribution''. Addressing this challenge, we introduce a bespoke +strategy designed to mitigate the effects of this skewed distribution. +Specifically, we propose two data augmentation strategies, including Stationary +Camera View Data Augmentation (SVA) and Dynamic Camera View Data Augmentation +(DVA) , designed for viewpoint states and the Group Softmax (GS) module for +Re-ID. SVA is to backtrack and predict the pedestrian trajectory of tail +classes, and DVA is to use diffusion model to change the background of the +scene. GS divides the pedestrians into unrelated groups and performs softmax +operation on each group individually. Our proposed strategies can be integrated +into numerous existing tracking systems, and extensive experimentation +validates the efficacy of our method in reducing the influence of long-tail +distribution on multi-object tracking performance. The code is available at +https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT.",cs.CV,['cs.CV'] +HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes,Yichen Yao · Zimo Jiang · YUJING SUN · Zhencai Zhu · Xinge Zhu · Runnan Chen · Yuexin Ma, ,https://arxiv.org/abs/2403.02769,,2403.02769.pdf,HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes,"Human-centric 3D scene understanding has recently drawn increasing attention, +driven by its critical impact on robotics. However, human-centric real-life +scenarios are extremely diverse and complicated, and humans have intricate +motions and interactions. With limited labeled data, supervised methods are +difficult to generalize to general scenarios, hindering real-life applications. +Mimicking human intelligence, we propose an unsupervised 3D detection method +for human-centric scenarios by transferring the knowledge from synthetic human +instances to real scenes. To bridge the gap between the distinct data +representations and feature distributions of synthetic models and real point +clouds, we introduce novel modules for effective instance-to-scene +representation transfer and synthetic-to-real feature alignment. Remarkably, +our method exhibits superior performance compared to current state-of-the-art +techniques, achieving 87.8% improvement in mAP and closely approaching the +performance of fully supervised methods (62.15 mAP vs. 69.02 mAP) on HuCenLife +Dataset.",cs.CV,['cs.CV'] +VAREN: Very Accurate and Realistic Equine Network,Silvia Zuffi · Ylva Mellbin · Ci Li · Markus Höschle · Hedvig Kjellström · Senya Polikovsky · Elin Hernlund · Michael J. Black,https://varen.is.tue.mpg.de/,,https://www.kth.se/is/rpl/rpl-news/accepted-publications-march-1.1339092,,,,,nan +Mask Grounding for Referring Image Segmentation,Yong Xien Chng · Henry Zheng · Yizeng Han · Xuchong QIU · Gao Huang,https://yxchng.github.io/projects/mask-grounding/,https://arxiv.org/abs/2312.12198,,2312.12198.pdf,Mask Grounding for Referring Image Segmentation,"Referring Image Segmentation (RIS) is a challenging task that requires an +algorithm to segment objects referred by free-form language expressions. +Despite significant progress in recent years, most state-of-the-art (SOTA) +methods still suffer from considerable language-image modality gap at the pixel +and word level. These methods generally 1) rely on sentence-level language +features for language-image alignment and 2) lack explicit training supervision +for fine-grained visual grounding. Consequently, they exhibit weak object-level +correspondence between visual and language features. Without well-grounded +features, prior methods struggle to understand complex expressions that require +strong reasoning over relationships among multiple objects, especially when +dealing with rarely used or ambiguous clauses. To tackle this challenge, we +introduce a novel Mask Grounding auxiliary task that significantly improves +visual grounding within language features, by explicitly teaching the model to +learn fine-grained correspondence between masked textual tokens and their +matching visual objects. Mask Grounding can be directly used on prior RIS +methods and consistently bring improvements. Furthermore, to holistically +address the modality gap, we also design a cross-modal alignment loss and an +accompanying alignment module. These additions work synergistically with Mask +Grounding. With all these techniques, our comprehensive approach culminates in +MagNet (Mask-grounded Network), an architecture that significantly outperforms +prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating +our method's effectiveness in addressing current limitations of RIS algorithms. +Our code and pre-trained weights will be released.",cs.CV,['cs.CV'] +MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model,Kaiyu Song · Hanjiang Lai · Yan Pan · Jian Yin, ,https://arxiv.org/abs/2312.04802,,2312.04802.pdf,MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model,"Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where +an imperceptible perturbation is added to the image that can fool the DNNs. +Diffusion-based adversarial purification focuses on using the diffusion model +to generate a clean image against such adversarial attacks. Unfortunately, the +generative process of the diffusion model is also inevitably affected by +adversarial perturbation since the diffusion model is also a deep network where +its input has adversarial perturbation. In this work, we propose +MimicDiffusion, a new diffusion-based adversarial purification technique, that +directly approximates the generative process of the diffusion model with the +clean image as input. Concretely, we analyze the differences between the guided +terms using the clean image and the adversarial sample. After that, we first +implement MimicDiffusion based on Manhattan distance. Then, we propose two +guidance to purify the adversarial perturbation and approximate the clean +diffusion model. Extensive experiments on three image datasets including +CIFAR-10, CIFAR-100, and ImageNet with three classifier backbones including +WideResNet-70-16, WideResNet-28-10, and ResNet50 demonstrate that +MimicDiffusion significantly performs better than the state-of-the-art +baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%, +and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\% +higher, respectively. The code is available in the supplementary material.",cs.CV,['cs.CV'] +Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing,Hyelin Nam · Gihyun Kwon · Geon Yeong Park · Jong Chul Ye,https://hyelinnam.github.io/CDS/,https://arxiv.org/abs/2311.18608,,2311.18608.pdf,Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing,"With the remarkable advent of text-to-image diffusion models, image editing +methods have become more diverse and continue to evolve. A promising recent +approach in this realm is Delta Denoising Score (DDS) - an image editing +technique based on Score Distillation Sampling (SDS) framework that leverages +the rich generative prior of text-to-image diffusion models. However, relying +solely on the difference between scoring functions is insufficient for +preserving specific structural elements from the original image, a crucial +aspect of image editing. To address this, here we present an embarrassingly +simple yet very powerful modification of DDS, called Contrastive Denoising +Score (CDS), for latent diffusion models (LDM). Inspired by the similarities +and differences between DDS and the contrastive learning for unpaired +image-to-image translation(CUT), we introduce a straightforward approach using +CUT loss within the DDS framework. Rather than employing auxiliary networks as +in the original CUT approach, we leverage the intermediate features of LDM, +specifically those from the self-attention layers, which possesses rich spatial +information. Our approach enables zero-shot image-to-image translation and +neural radiance field (NeRF) editing, achieving structural correspondence +between the input and output while maintaining content controllability. +Qualitative results and comparisons demonstrates the effectiveness of our +proposed method. Project page: https://hyelinnam.github.io/CDS/",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling,Jiawei Shi · Hui Deng · Yuchao Dai,https://npucvr.github.io/TSM-NRSfM/,https://arxiv.org/abs/2405.04309,,2405.04309.pdf,Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling,"Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively +studied and great progress has been made, there are still key challenges that +hinder their broad real-world applications: 1) the inherent motion/rotation +ambiguity requires either explicit camera motion recovery with extra constraint +or complex Procrustean Alignment; 2) existing low-rank modeling of the global +shape can over-penalize drastic deformations in the 3D shape sequence. This +paper proposes to resolve the above issues from a spatial-temporal modeling +perspective. First, we propose a novel Temporally-smooth Procrustean Alignment +module that estimates 3D deforming shapes and adjusts the camera motion by +aligning the 3D shape sequence consecutively. Our new alignment module remedies +the requirement of complex reference 3D shape during alignment, which is more +conductive to non-isotropic deformation modeling. Second, we propose a +spatial-weighted approach to enforce the low-rank constraint adaptively at +different locations to accommodate drastic spatially-variant deformation +reconstruction better. Our modeling outperform existing low-rank based methods, +and extensive experiments across different datasets validate the effectiveness +of our method.",cs.CV,['cs.CV'] +Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization,Yujia Liu · Chenxi Yang · Dingquan Li · Jianhao Ding · Tingting Jiang, ,https://arxiv.org/abs/2403.11397,,2403.11397.pdf,Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization,"The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the +quality score of an input image without additional information. NR-IQA models +play a crucial role in the media industry, aiding in performance evaluation and +optimization guidance. However, these models are found to be vulnerable to +adversarial attacks, which introduce imperceptible perturbations to input +images, resulting in significant changes in predicted scores. In this paper, we +propose a defense method to improve the stability in predicted scores when +attacked by small perturbations, thus enhancing the adversarial robustness of +NR-IQA models. To be specific, we present theoretical evidence showing that the +magnitude of score changes is related to the $\ell_1$ norm of the model's +gradient with respect to the input image. Building upon this theoretical +foundation, we propose a norm regularization training strategy aimed at +reducing the $\ell_1$ norm of the gradient, thereby boosting the robustness of +NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate +the effectiveness of our strategy in reducing score changes in the presence of +adversarial attacks. To the best of our knowledge, this work marks the first +attempt to defend against adversarial attacks on NR-IQA models. Our study +offers valuable insights into the adversarial robustness of NR-IQA models and +provides a foundation for future research in this area.",cs.CV,"['cs.CV', 'eess.IV']" +A Unified Framework for Human-centric Point Cloud Video Understanding,Yiteng Xu · Kecheng Ye · xiao han · yiming ren · Xinge Zhu · Yuexin Ma, ,https://arxiv.org/abs/2403.20031,,2403.20031.pdf,A Unified Framework for Human-centric Point Cloud Video Understanding,"Human-centric Point Cloud Video Understanding (PVU) is an emerging field +focused on extracting and interpreting human-related features from sequences of +human point clouds, further advancing downstream human-centric tasks and +applications. Previous works usually focus on tackling one specific task and +rely on huge labeled data, which has poor generalization capability. +Considering that human has specific characteristics, including the structural +semantics of human body and the dynamics of human motions, we propose a unified +framework to make full use of the prior knowledge and explore the inherent +features in the data itself for generalized human-centric point cloud video +understanding. Extensive experiments demonstrate that our method achieves +state-of-the-art performance on various human-related tasks, including action +recognition and 3D pose estimation. All datasets and code will be released +soon.",cs.CV,['cs.CV'] +iKUN: Speak to Trackers without Retraining,Yunhao Du · Cheng Lei · Zhicheng Zhao · Fei Su,https://github.com/dyhBUPT/iKUN,https://arxiv.org/abs/2312.16245,,2312.16245.pdf,iKUN: Speak to Trackers without Retraining,"Referring multi-object tracking (RMOT) aims to track multiple objects based +on input textual descriptions. Previous works realize it by simply integrating +an extra textual module into the multi-object tracker. However, they typically +need to retrain the entire framework and have difficulties in optimization. In +this work, we propose an insertable Knowledge Unification Network, termed iKUN, +to enable communication with off-the-shelf trackers in a plug-and-play manner. +Concretely, a knowledge unification module (KUM) is designed to adaptively +extract visual features based on textual guidance. Meanwhile, to improve the +localization accuracy, we present a neural version of Kalman filter (NKF) to +dynamically adjust process noise and observation noise based on the current +motion status. Moreover, to address the problem of open-set long-tail +distribution of textual descriptions, a test-time similarity calibration method +is proposed to refine the confidence score with pseudo frequency. Extensive +experiments on Refer-KITTI dataset verify the effectiveness of our framework. +Finally, to speed up the development of RMOT, we also contribute a more +challenging dataset, Refer-Dance, by extending public DanceTrack dataset with +motion and dressing descriptions. The codes and dataset are available at +https://github.com/dyhBUPT/iKUN.",cs.CV,['cs.CV'] +Accelerating Diffusion Sampling with Optimized Time Steps,Shuchen Xue · Zhaoqiang Liu · Fei Chen · Shifeng Zhang · Tianyang Hu · Enze Xie · Zhenguo Li, ,https://arxiv.org/abs/2402.17376,,2402.17376.pdf,Accelerating Diffusion Sampling with Optimized Time Steps,"Diffusion probabilistic models (DPMs) have shown remarkable performance in +high-resolution image synthesis, but their sampling efficiency is still to be +desired due to the typically large number of sampling steps. Recent +advancements in high-order numerical ODE solvers for DPMs have enabled the +generation of high-quality images with much fewer sampling steps. While this is +a significant development, most sampling methods still employ uniform time +steps, which is not optimal when using a small number of steps. To address this +issue, we propose a general framework for designing an optimization problem +that seeks more appropriate time steps for a specific numerical ODE solver for +DPMs. This optimization problem aims to minimize the distance between the +ground-truth solution to the ODE and an approximate solution corresponding to +the numerical solver. It can be efficiently solved using the constrained trust +region method, taking less than $15$ seconds. Our extensive experiments on both +unconditional and conditional sampling using pixel- and latent-space DPMs +demonstrate that, when combined with the state-of-the-art sampling method +UniPC, our optimized time steps significantly improve image generation +performance in terms of FID scores for datasets such as CIFAR-10 and ImageNet, +compared to using uniform time steps.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +DemoFusion: Democratising High-Resolution Image Generation With No $$$,Ruoyi DU · Dongliang Chang · Timothy Hospedales · Yi-Zhe Song · Zhanyu Ma,https://ruoyidu.github.io/demofusion/demofusion.html,https://arxiv.org/abs/2311.16973,,2311.16973.pdf,DemoFusion: Democratising High-Resolution Image Generation With No $$$,"High-resolution image generation with Generative Artificial Intelligence +(GenAI) has immense potential but, due to the enormous capital investment +required for training, it is increasingly centralised to a few large +corporations, and hidden behind paywalls. This paper aims to democratise +high-resolution GenAI by advancing the frontier of high-resolution generation +while remaining accessible to a broad audience. We demonstrate that existing +Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution +image generation. Our novel DemoFusion framework seamlessly extends open-source +GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated +Sampling mechanisms to achieve higher-resolution image generation. The +progressive nature of DemoFusion requires more passes, but the intermediate +results can serve as ""previews"", facilitating rapid prompt iteration.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Tumor Micro-environment Interactions Guided Graph Learning for Survival Analysis of Human Cancers from Whole-slide Pathological Images.,WEI SHAO · YangYang Shi · Daoqiang Zhang · Junjie Zhou · Peng Wan, ,,https://www.nature.com/articles/s41467-023-40890-x,,,,,nan +Adapting Short-Term Transformers for Action Detection in Untrimmed Videos,Min Yang · gaohuan · Ping Guo · Limin Wang, ,https://arxiv.org/abs/2312.01897,,2312.01897.pdf,Adapting Short-Term Transformers for Action Detection in Untrimmed Videos,"Vision Transformer (ViT) has shown high potential in video recognition, owing +to its flexible design, adaptable self-attention mechanisms, and the efficacy +of masked pre-training. Yet, it remains unclear how to adapt these pre-trained +short-term ViTs for temporal action detection (TAD) in untrimmed videos. The +existing works treat them as off-the-shelf feature extractors for each +short-trimmed snippet without capturing the fine-grained relation among +different snippets in a broader temporal context. To mitigate this issue, this +paper focuses on designing a new mechanism for adapting these pre-trained ViT +models as a unified long-form video transformer to fully unleash its modeling +power in capturing inter-snippet relation, while still keeping low computation +overhead and memory consumption for efficient TAD. To this end, we design +effective cross-snippet propagation modules to gradually exchange short-term +video information among different snippets from two levels. For inner-backbone +information propagation, we introduce a cross-snippet propagation strategy to +enable multi-snippet temporal feature interaction inside the backbone.For +post-backbone information propagation, we propose temporal transformer layers +for further clip-level modeling. With the plain ViT-B pre-trained with +VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very +competitive performance to previous temporal action detectors, riching up to +69.5 average mAP on THUMOS14, 37.40 average mAP on ActivityNet-1.3 and 17.20 +average mAP on FineAction.",cs.CV,['cs.CV'] +LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment,yiming ren · xiao han · Chengfeng Zhao · Jingya Wang · Lan Xu · Jingyi Yu · Yuexin Ma, ,https://arxiv.org/abs/2402.17171,,2402.17171.pdf,LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment,"For human-centric large-scale scenes, fine-grained modeling for 3D human +global pose and shape is significant for scene understanding and can benefit +many real-world applications. In this paper, we present LiveHPS, a novel +single-LiDAR-based approach for scene-level human pose and shape estimation +without any limitation of light conditions and wearable devices. In particular, +we design a distillation mechanism to mitigate the distribution-varying effect +of LiDAR point clouds and exploit the temporal-spatial geometric and dynamic +information existing in consecutive frames to solve the occlusion and noise +disturbance. LiveHPS, with its efficient configuration and high-quality output, +is well-suited for real-world applications. Moreover, we propose a huge human +motion dataset, named FreeMotion, which is collected in various scenarios with +diverse human poses, shapes and translations. It consists of multi-modal and +multi-view acquisition data from calibrated and synchronized LiDARs, cameras, +and IMUs. Extensive experiments on our new dataset and other public datasets +demonstrate the SOTA performance and robustness of our approach. We will +release our code and dataset soon.",cs.CV,['cs.CV'] +CoDe: An Explicit Content Decoupling Framework for Image Restoration,Enxuan Gu · Hongwei Ge · Yong Guo, ,https://arxiv.org/abs/2312.05006,,2312.05006.pdf,Decoupling Degradation and Content Processing for Adverse Weather Image Restoration,"Adverse weather image restoration strives to recover clear images from those +affected by various weather types, such as rain, haze, and snow. Each weather +type calls for a tailored degradation removal approach due to its unique impact +on images. Conversely, content reconstruction can employ a uniform approach, as +the underlying image content remains consistent. Although previous techniques +can handle multiple weather types within a single network, they neglect the +crucial distinction between these two processes, limiting the quality of +restored images. This work introduces a novel adverse weather image restoration +method, called DDCNet, which decouples the degradation removal and content +reconstruction process at the feature level based on their channel statistics. +Specifically, we exploit the unique advantages of the Fourier transform in both +these two processes: (1) the degradation information is mainly located in the +amplitude component of the Fourier domain, and (2) the Fourier domain contains +global information. The former facilitates channel-dependent degradation +removal operation, allowing the network to tailor responses to various adverse +weather types; the latter, by integrating Fourier's global properties into +channel-independent content features, enhances network capacity for consistent +global content reconstruction. We further augment the degradation removal +process with a degradation mapping loss function. Extensive experiments +demonstrate our method achieves state-of-the-art performance in multiple +adverse weather removal benchmarks.",cs.CV,['cs.CV'] +MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation,Xiaolong Deng · Huisi Wu · Runhao Zeng · Jing Qin,https://github.com/dengxl0520/MemSAM,https://arxiv.org/abs/2311.10529,,2311.10529.pdf,Enhancing the Reliability of Segment Anything Model for Auto-Prompting Medical Image Segmentation with Uncertainty Rectification,"The Segment Anything Model (SAM) has recently emerged as a groundbreaking +foundation model for prompt-driven image segmentation tasks. However, both the +original SAM and its medical variants require slice-by-slice manual prompting +of target structures, which directly increase the burden for applications. +Despite attempts of auto-prompting to turn SAM into a fully automatic manner, +it still exhibits subpar performance and lacks of reliability especially in the +field of medical imaging. In this paper, we propose UR-SAM, an uncertainty +rectified SAM framework to enhance the reliability for auto-prompting medical +image segmentation. Building upon a localization framework for automatic prompt +generation, our method incorporates a prompt augmentation module to obtain a +series of input prompts for SAM for uncertainty estimation and an +uncertainty-based rectification module to further utilize the distribution of +estimated uncertainty to improve the segmentation performance. Extensive +experiments on two public 3D medical datasets covering the segmentation of 35 +organs demonstrate that without supplementary training or fine-tuning, our +method further improves the segmentation performance with up to 10.7 % and 13.8 +% in dice similarity coefficient, demonstrating efficiency and broad +capabilities for medical image segmentation without manual prompting.",cs.CV,['cs.CV'] +Generative Powers of Ten,Xiaojuan Wang · Janne Kontkanen · Brian Curless · Steve Seitz · Ira Kemelmacher-Shlizerman · Ben Mildenhall · Pratul P. Srinivasan · Dor Verbin · Aleksander Holynski, ,https://arxiv.org/abs/2312.02149,,2312.02149.pdf,Generative Powers of Ten,"We present a method that uses a text-to-image model to generate consistent +content across multiple image scales, enabling extreme semantic zooms into a +scene, e.g., ranging from a wide-angle landscape view of a forest to a macro +shot of an insect sitting on one of the tree branches. We achieve this through +a joint multi-scale diffusion sampling approach that encourages consistency +across different scales while preserving the integrity of each individual +sampling process. Since each generated scale is guided by a different text +prompt, our method enables deeper levels of zoom than traditional +super-resolution methods that may struggle to create new contextual structure +at vastly different scales. We compare our method qualitatively with +alternative techniques in image super-resolution and outpainting, and show that +our method is most effective at generating consistent multi-scale content.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.GR']" +When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach,TAO MA · Bing Bai · Haozhe Lin · Heyuan Wang · Yu Wang · Lin Luo · Lu Fang, ,https://arxiv.org/abs/2307.11558,,2307.11558.pdf,Advancing Visual Grounding with Scene Knowledge: Benchmark and Method,"Visual grounding (VG) aims to establish fine-grained alignment between vision +and language. Ideally, it can be a testbed for vision-and-language models to +evaluate their understanding of the images and texts and their reasoning +abilities over their joint space. However, most existing VG datasets are +constructed using simple description texts, which do not require sufficient +reasoning over the images and texts. This has been demonstrated in a recent +study~\cite{luo2022goes}, where a simple LSTM-based text encoder without +pretraining can achieve state-of-the-art performance on mainstream VG datasets. +Therefore, in this paper, we propose a novel benchmark of \underline{S}cene +\underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), +where the image content and referring expressions are not sufficient to ground +the target objects, forcing the models to have a reasoning ability on the +long-form scene knowledge. To perform this task, we propose two approaches to +accept the triple-type input, where the former embeds knowledge into the image +features before the image-query interaction; the latter leverages linguistic +structure to assist in computing the image-text matching. We conduct extensive +experiments to analyze the above methods and show that the proposed approaches +achieve promising results but still leave room for improvement, including +performance and interpretability. The dataset and code are available at +\url{https://github.com/zhjohnchan/SK-VG}.",cs.CV,"['cs.CV', 'cs.CL']" +Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization,Jimyeong Kim · Jungwon Park · Wonjong Rhee, ,https://arxiv.org/abs/2403.15330,,2403.15330.pdf,Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization,"In text-to-image personalization, a timely and crucial challenge is the +tendency of generated images overfitting to the biases present in the reference +images. We initiate our study with a comprehensive categorization of the biases +into background, nearby-object, tied-object, substance (in style +re-contextualization), and pose biases. These biases manifest in the generated +images due to their entanglement into the subject embedding. This undesired +embedding entanglement not only results in the reflection of biases from the +reference images into the generated images but also notably diminishes the +alignment of the generated images with the given generation prompt. To address +this challenge, we propose SID~(Selectively Informative Description), a text +description strategy that deviates from the prevalent approach of only +characterizing the subject's class identification. SID is generated utilizing +multimodal GPT-4 and can be seamlessly integrated into optimization-based +models. We present comprehensive experimental results along with analyses of +cross-attention maps, subject-alignment, non-subject-disentanglement, and +text-alignment.",cs.CV,['cs.CV'] +SFOD: Spiking Fusion Object Detector,Yimeng Fan · Wei Zhang · Changsong Liu · Mingyang Li · Wenrui Lu,https://github.com/yimeng-fan/SFOD,https://arxiv.org/abs/2403.15192,,2403.15192.pdf,SFOD: Spiking Fusion Object Detector,"Event cameras, characterized by high temporal resolution, high dynamic range, +low power consumption, and high pixel bandwidth, offer unique capabilities for +object detection in specialized contexts. Despite these advantages, the +inherent sparsity and asynchrony of event data pose challenges to existing +object detection algorithms. Spiking Neural Networks (SNNs), inspired by the +way the human brain codes and processes information, offer a potential solution +to these difficulties. However, their performance in object detection using +event cameras is limited in current implementations. In this paper, we propose +the Spiking Fusion Object Detector (SFOD), a simple and efficient approach to +SNN-based object detection. Specifically, we design a Spiking Fusion Module, +achieving the first-time fusion of feature maps from different scales in SNNs +applied to event cameras. Additionally, through integrating our analysis and +experiments conducted during the pretraining of the backbone network on the +NCAR dataset, we delve deeply into the impact of spiking decoding strategies +and loss functions on model performance. Thereby, we establish state-of-the-art +classification results based on SNNs, achieving 93.7\% accuracy on the NCAR +dataset. Experimental results on the GEN1 detection dataset demonstrate that +the SFOD achieves a state-of-the-art mAP of 32.1\%, outperforming existing +SNN-based approaches. Our research not only underscores the potential of SNNs +in object detection with event cameras but also propels the advancement of +SNNs. Code is available at https://github.com/yimeng-fan/SFOD.",cs.CV,"['cs.CV', 'cs.AI']" +A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives,Simone Alberto Peirone · Francesca Pistilli · Antonio Alliegro · Giuseppe Averta,https://sapeirone.github.io/EgoPack/,https://arxiv.org/abs/2403.03037,,2403.03037.pdf,A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives,"Human comprehension of a video stream is naturally broad: in a few instants, +we are able to understand what is happening, the relevance and relationship of +objects, and forecast what will follow in the near future, everything all at +once. We believe that - to effectively transfer such an holistic perception to +intelligent machines - an important role is played by learning to correlate +concepts and to abstract knowledge coming from different tasks, to +synergistically exploit them when learning novel skills. To accomplish this, we +seek for a unified approach to video understanding which combines shared +temporal modelling of human actions with minimal overhead, to support multiple +downstream tasks and enable cooperation when learning novel skills. We then +propose EgoPack, a solution that creates a collection of task perspectives that +can be carried across downstream tasks and used as a potential source of +additional insights, as a backpack of skills that a robot can carry around and +use when needed. We demonstrate the effectiveness and efficiency of our +approach on four Ego4D benchmarks, outperforming current state-of-the-art +methods.",cs.CV,"['cs.CV', 'cs.LG']" +How Far Can We Compress Instant NGP-Based NeRF?,Yihang Chen · Qianyi Wu · Mehrtash Harandi · Jianfei Cai, ,https://arxiv.org/abs/2310.14695,,2310.14695.pdf,CAwa-NeRF: Instant Learning of Compression-Aware NeRF Features,"Modeling 3D scenes by volumetric feature grids is one of the promising +directions of neural approximations to improve Neural Radiance Fields (NeRF). +Instant-NGP (INGP) introduced multi-resolution hash encoding from a lookup +table of trainable feature grids which enabled learning high-quality neural +graphics primitives in a matter of seconds. However, this improvement came at +the cost of higher storage size. In this paper, we address this challenge by +introducing instant learning of compression-aware NeRF features (CAwa-NeRF), +that allows exporting the zip compressed feature grids at the end of the model +training with a negligible extra time overhead without changing neither the +storage architecture nor the parameters used in the original INGP paper. +Nonetheless, the proposed method is not limited to INGP but could also be +adapted to any model. By means of extensive simulations, our proposed instant +learning pipeline can achieve impressive results on different kinds of static +scenes such as single object masked background scenes and real-life scenes +captured in our studio. In particular, for single object masked background +scenes CAwa-NeRF compresses the feature grids down to 6% (1.2 MB) of the +original size without any loss in the PSNR (33 dB) or down to 2.4% (0.53 MB) +with a slight virtual loss (32.31 dB).",cs.CV,['cs.CV'] +Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners,Yazhou Xing · Yingqing He · Zeyue Tian · Xintao Wang · Qifeng Chen, ,https://arxiv.org/abs/2402.17723,,2402.17723.pdf,Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners,"Video and audio content creation serves as the core technique for the movie +industry and professional users. Recently, existing diffusion-based methods +tackle video and audio generation separately, which hinders the technique +transfer from academia to industry. In this work, we aim at filling the gap, +with a carefully designed optimization-based framework for cross-visual-audio +and joint-visual-audio generation. We observe the powerful generation ability +of off-the-shelf video or audio generation models. Thus, instead of training +the giant models from scratch, we propose to bridge the existing strong models +with a shared latent representation space. Specifically, we propose a +multimodality latent aligner with the pre-trained ImageBind model. Our latent +aligner shares a similar core as the classifier guidance that guides the +diffusion denoising process during inference time. Through carefully designed +optimization strategy and loss functions, we show the superior performance of +our method on joint video-audio generation, visual-steered audio generation, +and audio-steered visual generation tasks. The project website can be found at +https://yzxing87.github.io/Seeing-and-Hearing/",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']" +VidToMe: Video Token Merging for Zero-Shot Video Editing,Xirui Li · Chao Ma · Xiaokang Yang · Ming-Hsuan Yang,https://vidtome-diffusion.github.io/,https://arxiv.org/abs/2312.10656,,2312.10656.pdf,VidToMe: Video Token Merging for Zero-Shot Video Editing,"Diffusion models have made significant advances in generating high-quality +images, but their application to video generation has remained challenging due +to the complexity of temporal motion. Zero-shot video editing offers a solution +by utilizing pre-trained image diffusion models to translate source videos into +new ones. Nevertheless, existing methods struggle to maintain strict temporal +consistency and efficient memory consumption. In this work, we propose a novel +approach to enhance temporal consistency in generated videos by merging +self-attention tokens across frames. By aligning and compressing temporally +redundant tokens across frames, our method improves temporal coherence and +reduces memory consumption in self-attention computations. The merging strategy +matches and aligns tokens according to the temporal correspondence between +frames, facilitating natural temporal consistency in generated video frames. To +manage the complexity of video processing, we divide videos into chunks and +develop intra-chunk local token merging and inter-chunk global token merging, +ensuring both short-term video continuity and long-term content consistency. +Our video editing approach seamlessly extends the advancements in image editing +to video editing, rendering favorable results in temporal consistency over +state-of-the-art methods.",cs.CV,['cs.CV'] +Real-Time Exposure Correction via Collaborative Transformations and Adaptive Sampling,Ziwen Li · Feng Zhang · Meng Cao · Jinpu Zhang · Yuanjie Shao · Yuehuan Wang · Nong Sang,https://github.com/HUST-IAL/CoTF,,https://www.semanticscholar.org/paper/An-Efficient-Method-for-Real-Time-Image-Exposure-Yang-Zhang/b40baf5034dcc98f06f53abe907b9ac0395e2bb2,,,,,nan +Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering,Zhiwen Yan · Weng Fei Low · Yu Chen · Gim Hee Lee, ,https://arxiv.org/abs/2311.17089,,2311.17089.pdf,Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering,"3D Gaussians have recently emerged as a highly efficient representation for +3D reconstruction and rendering. Despite its high rendering quality and speed +at high resolutions, they both deteriorate drastically when rendered at lower +resolutions or from far away camera position. During low resolution or far away +rendering, the pixel size of the image can fall below the Nyquist frequency +compared to the screen size of each splatted 3D Gaussian and leads to aliasing +effect. The rendering is also drastically slowed down by the sequential alpha +blending of more splatted Gaussians per pixel. To address these issues, we +propose a multi-scale 3D Gaussian splatting algorithm, which maintains +Gaussians at different scales to represent the same scene. Higher-resolution +images are rendered with more small Gaussians, and lower-resolution images are +rendered with fewer larger Gaussians. With similar training time, our algorithm +can achieve 13\%-66\% PSNR and 160\%-2400\% rendering speed improvement at +4$\times$-128$\times$ scale rendering on Mip-NeRF360 dataset compared to the +single scale 3D Gaussian splitting. Our code and more results are available on +our project website https://jokeryan.github.io/projects/ms-gs/",cs.CV,['cs.CV'] +Construct to Associate: Cooperative Context Learning for Domain Adaptive Point Cloud Segmentation,Guangrui Li, ,,https://ieeexplore.ieee.org/document/10330760,,,,,nan +Holistic Features are almost Sufficient for Text-to-Video Retrieval,Kaibin Tian · Ruixiang Zhao · Zijie Xin · Bangxiang Lan · Xirong Li,https://github.com/ruc-aimc-lab/TeachCLIP,,https://lixirong.net/research/cvpr2024-holistic-features-are-almost-sufficient-for-text-to-video-retrieval,,,,,nan +TE-TAD: Towards Fully End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression,Ho-Joong Kim · Jung-Ho Hong · Heejo Kong · Seong-Whan Lee, ,https://arxiv.org/abs/2404.02405,,2404.02405.pdf,TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression,"In this paper, we investigate that the normalized coordinate expression is a +key factor as reliance on hand-crafted components in query-based detectors for +temporal action detection (TAD). Despite significant advancements towards an +end-to-end framework in object detection, query-based detectors have been +limited in achieving full end-to-end modeling in TAD. To address this issue, we +propose \modelname{}, a full end-to-end temporal action detection transformer +that integrates time-aligned coordinate expression. We reformulate coordinate +expression utilizing actual timeline values, ensuring length-invariant +representations from the extremely diverse video duration environment. +Furthermore, our proposed adaptive query selection dynamically adjusts the +number of queries based on video length, providing a suitable solution for +varying video durations compared to a fixed query set. Our approach not only +simplifies the TAD process by eliminating the need for hand-crafted components +but also significantly improves the performance of query-based detectors. Our +TE-TAD outperforms the previous query-based detectors and achieves competitive +performance compared to state-of-the-art methods on popular benchmark datasets. +Code is available at: https://github.com/Dotori-HJ/TE-TAD",cs.CV,['cs.CV'] +Dual Prototype Attention for Unsupervised Video Object Segmentation,Suhwan Cho · Minhyeok Lee · Seunghoon Lee · Dogyoon Lee · Heeseung Choi · Ig-Jae Kim · Sangyoun Lee, ,https://arxiv.org/abs/2309.14786,,2309.14786.pdf,Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation,"Unsupervised video object segmentation (VOS) is a task that aims to detect +the most salient object in a video without external guidance about the object. +To leverage the property that salient objects usually have distinctive +movements compared to the background, recent methods collaboratively use motion +cues extracted from optical flow maps with appearance cues extracted from RGB +images. However, as optical flow maps are usually very relevant to segmentation +masks, the network is easy to be learned overly dependent on the motion cues +during network training. As a result, such two-stream approaches are vulnerable +to confusing motion cues, making their prediction unstable. To relieve this +issue, we design a novel motion-as-option network by treating motion cues as +optional. During network training, RGB images are randomly provided to the +motion encoder instead of optical flow maps, to implicitly reduce motion +dependency of the network. As the learned motion encoder can deal with both RGB +images and optical flow maps, two different predictions can be generated +depending on which source information is used as motion input. In order to +fully exploit this property, we also propose an adaptive output selection +algorithm to adopt optimal prediction result at test time. Our proposed +approach affords state-of-the-art performance on all public benchmark datasets, +even maintaining real-time inference speed.",cs.CV,['cs.CV'] +Adaptive Slot Attention: Object Discovery with Dynamic Slot Number,Ke Fan · Zechen Bai · Tianjun Xiao · Tong He · Max Horn · Yanwei Fu · Francesco Locatello · Zheng Zhang, ,https://arxiv.org/abs/2307.09437,,2307.09437.pdf,Grounded Object Centric Learning,"The extraction of modular object-centric representations for downstream tasks +is an emerging area of research. Learning grounded representations of objects +that are guaranteed to be stable and invariant promises robust performance +across different tasks and environments. Slot Attention (SA) learns +object-centric representations by assigning objects to \textit{slots}, but +presupposes a \textit{single} distribution from which all slots are randomly +initialised. This results in an inability to learn \textit{specialized} slots +which bind to specific object types and remain invariant to identity-preserving +changes in object appearance. To address this, we present +\emph{\textsc{Co}nditional \textsc{S}lot \textsc{A}ttention} (\textsc{CoSA}) +using a novel concept of \emph{Grounded Slot Dictionary} (GSD) inspired by +vector quantization. Our proposed GSD comprises (i) canonical object-level +property vectors and (ii) parametric Gaussian distributions, which define a +prior over the slots. We demonstrate the benefits of our method in multiple +downstream tasks such as scene generation, composition, and task adaptation, +whilst remaining competitive with SA in popular object discovery benchmarks.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras,Ashwath Shetty · Marc Habermann · Guoxing Sun · Diogo Luvizon · Vladislav Golyanik · Christian Theobalt, ,https://arxiv.org/abs/2312.07423,,2312.07423.pdf,Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras,"We present the first approach to render highly realistic free-viewpoint +videos of a human actor in general apparel, from sparse multi-view recording to +display, in real-time at an unprecedented 4K resolution. At inference, our +method only requires four camera views of the moving actor and the respective +3D skeletal pose. It handles actors in wide clothing, and reproduces even +fine-scale dynamic detail, e.g. clothing wrinkles, face expressions, and hand +gestures. At training time, our learning-based approach expects dense +multi-view video and a rigged static surface scan of the actor. Our method +comprises three main stages. Stage 1 is a skeleton-driven neural approach for +high-quality capture of the detailed dynamic mesh geometry. Stage 2 is a novel +solution to create a view-dependent texture using four test-time camera views +as input. Finally, stage 3 comprises a new image-based refinement network +rendering the final 4K image given the output from the previous stages. Our +approach establishes a new benchmark for real-time rendering resolution and +quality using sparse input camera views, unlocking possibilities for immersive +telepresence.",cs.CV,['cs.CV'] +Seeing the World through Your Eyes,Hadi Alzayer · Kevin Zhang · Brandon Y. Feng · Christopher Metzler · Jia-Bin Huang, ,https://arxiv.org/abs/2306.09348,,2306.09348.pdf,Seeing the World through Your Eyes,"The reflective nature of the human eye is an underappreciated source of +information about what the world around us looks like. By imaging the eyes of a +moving person, we can collect multiple views of a scene outside the camera's +direct line of sight through the reflections in the eyes. In this paper, we +reconstruct a 3D scene beyond the camera's line of sight using portrait images +containing eye reflections. This task is challenging due to 1) the difficulty +of accurately estimating eye poses and 2) the entangled appearance of the eye +iris and the scene reflections. Our method jointly refines the cornea poses, +the radiance field depicting the scene, and the observer's eye iris texture. We +further propose a simple regularization prior on the iris texture pattern to +improve reconstruction quality. Through various experiments on synthetic and +real-world captures featuring people with varied eye colors, we demonstrate the +feasibility of our approach to recover 3D scenes using eye reflections.",cs.CV,['cs.CV'] +NeRF Analogies - Example-Based Visual Attribute Transfer for NeRFs,Michael Fischer · Zhengqin Li · Thu Nguyen-Phuoc · Aljaž Božič · Zhao Dong · Carl Marshall · Tobias Ritschel, ,https://arxiv.org/abs/2402.08622,,2402.08622.pdf,NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs,"A Neural Radiance Field (NeRF) encodes the specific relation of 3D geometry +and appearance of a scene. We here ask the question whether we can transfer the +appearance from a source NeRF onto a target 3D geometry in a semantically +meaningful way, such that the resulting new NeRF retains the target geometry +but has an appearance that is an analogy to the source NeRF. To this end, we +generalize classic image analogies from 2D images to NeRFs. We leverage +correspondence transfer along semantic affinity that is driven by semantic +features from large, pre-trained 2D image models to achieve multi-view +consistent appearance transfer. Our method allows exploring the mix-and-match +product space of 3D geometry and appearance. We show that our method +outperforms traditional stylization-based methods and that a large majority of +users prefer our method over several typical baselines.",cs.CV,"['cs.CV', 'cs.GR']" +MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers,Haoyu Ma · Shahin Mahdizadehaghdam · Bichen Wu · Zhipeng Fan · Yuchao Gu · Wenliang Zhao · Lior Shapira · Xiaohui Xie, ,https://arxiv.org/abs/2312.12468,,2312.12468.pdf,MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers,"Recent advances in generative AI have significantly enhanced image and video +editing, particularly in the context of text prompt control. State-of-the-art +approaches predominantly rely on diffusion models to accomplish these tasks. +However, the computational demands of diffusion-based methods are substantial, +often necessitating large-scale paired datasets for training, and therefore +challenging the deployment in real applications. To address these issues, this +paper breaks down the text-based video editing task into two stages. First, we +leverage an pre-trained text-to-image diffusion model to simultaneously edit +few keyframes in an zero-shot way. Second, we introduce an efficient model +called MaskINT, which is built on non-autoregressive masked generative +transformers and specializes in frame interpolation between the edited +keyframes, using the structural guidance from intermediate frames. Experimental +results suggest that our MaskINT achieves comparable performance with +diffusion-based methodologies, while significantly improve the inference time. +This research offers a practical solution for text-based video editing and +showcases the potential of non-autoregressive masked generative transformers in +this domain.",cs.CV,['cs.CV'] +Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing,ChangHee Yang · ChanHee Kang · Kyeongbo Kong · Hanni Oh · Suk-Ju Kang,https://yangchanghee.github.io/Person-in-Place_page/,,https://vds.sogang.ac.kr/?cat=5,,,,,nan +Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis,Mingyang Zhao · Jiang Jingen · Lei Ma · Shiqing Xin · Gaofeng Meng · Dong-Ming Yan, ,,https://link.springer.com/article/10.1007/s11042-023-16854-0,,,,,nan +Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion,Litu Rout · Yujia Chen · Abhishek Kumar · Constantine Caramanis · Sanjay Shakkottai · Wen-Sheng Chu,https://stsl-inverse-edit.github.io/,https://arxiv.org/abs/2312.00852,,2312.00852.pdf,Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion,"Sampling from the posterior distribution poses a major computational +challenge in solving inverse problems using latent diffusion models. Common +methods rely on Tweedie's first-order moments, which are known to induce a +quality-limiting bias. Existing second-order approximations are impractical due +to prohibitive computational costs, making standard reverse diffusion processes +intractable for posterior sampling. This paper introduces Second-order Tweedie +sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency +comparable to first-order Tweedie with a tractable reverse process using +second-order approximation. Our theoretical results reveal that the +second-order approximation is lower bounded by our surrogate loss that only +requires $O(1)$ compute using the trace of the Hessian, and by the lower bound +we derive a new drift term to make the reverse process tractable. Our method +surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural +function evaluations, respectively, while notably enhancing sampling quality on +FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to +text-guided image editing and addresses residual distortions present from +corrupted images in leading text-guided image editing methods. To our best +knowledge, this is the first work to offer an efficient second-order +approximation in solving inverse problems using latent diffusion and editing +real-world images with corruptions.",cs.LG,"['cs.LG', 'cs.CV', 'stat.ML']" +Human Motion Prediction under Unexpected Perturbation,Jiangbei Yue · Baiyi Li · Julien Pettré · Armin Seyfried · He Wang, ,https://arxiv.org/abs/2403.15891,,2403.15891.pdf,Human Motion Prediction under Unexpected Perturbation,"We investigate a new task in human motion prediction, which is predicting +motions under unexpected physical perturbation potentially involving multiple +people. Compared with existing research, this task involves predicting less +controlled, unpremeditated and pure reactive motions in response to external +impact and how such motions can propagate through people. It brings new +challenges such as data scarcity and predicting complex interactions. To this +end, we propose a new method capitalizing differential physics and deep neural +networks, leading to an explicit Latent Differential Physics (LDP) model. +Through experiments, we demonstrate that LDP has high data efficiency, +outstanding prediction accuracy, strong generalizability and good +explainability. Since there is no similar research, a comprehensive comparison +with 11 adapted baselines from several relevant domains is conducted, showing +LDP outperforming existing research both quantitatively and qualitatively, +improving prediction accuracy by as much as 70%, and demonstrating +significantly stronger generalization.",cs.CV,['cs.CV'] +TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation,Xiaopei Wu · Yuenan Hou · Xiaoshui Huang · Binbin Lin · Tong He · Xinge Zhu · Yuexin Ma · Boxi Wu · Haifeng Liu · Deng Cai · Wanli Ouyang, ,https://arxiv.org/html/2309.07849v3,,2309.07849v3.pdf,TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation,"LiDAR semantic segmentation plays a crucial role in enabling autonomous +driving and robots to understand their surroundings accurately and robustly. A +multitude of methods exist within this domain, including point-based, +range-image-based, polar-coordinate-based, and hybrid strategies. Among these, +range-image-based techniques have gained widespread adoption in practical +applications due to their efficiency. However, they face a significant +challenge known as the ``many-to-one'' problem caused by the range image's +limited horizontal and vertical angular resolution. As a result, around 20% of +the 3D points can be occluded. In this paper, we present TFNet, a +range-image-based LiDAR semantic segmentation method that utilizes temporal +information to address this issue. Specifically, we incorporate a temporal +fusion layer to extract useful information from previous scans and integrate it +with the current scan. We then design a max-voting-based post-processing +technique to correct false predictions, particularly those caused by the +``many-to-one'' issue. We evaluated the approach on two benchmarks and +demonstrated that the plug-in post-processing technique is generic and can be +applied to various networks.",cs.CV,['cs.CV'] +ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning,Beomyoung Kim · Joonsang Yu · Sung Ju Hwang,https://github.com/clovaai/ECLIPSE,https://arxiv.org/abs/2403.20126,,2403.20126.pdf,ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning,"Panoptic segmentation, combining semantic and instance segmentation, stands +as a cutting-edge computer vision task. Despite recent progress with deep +learning models, the dynamic nature of real-world applications necessitates +continual learning, where models adapt to new classes (plasticity) over time +without forgetting old ones (catastrophic forgetting). Current continual +segmentation methods often rely on distillation strategies like knowledge +distillation and pseudo-labeling, which are effective but result in increased +training complexity and computational overhead. In this paper, we introduce a +novel and efficient method for continual panoptic segmentation based on Visual +Prompt Tuning, dubbed ECLIPSE. Our approach involves freezing the base model +parameters and fine-tuning only a small set of prompt embeddings, addressing +both catastrophic forgetting and plasticity and significantly reducing the +trainable parameters. To mitigate inherent challenges such as error propagation +and semantic drift in continual segmentation, we propose logit manipulation to +effectively leverage common knowledge across the classes. Experiments on ADE20K +continual panoptic segmentation benchmark demonstrate the superiority of +ECLIPSE, notably its robustness against catastrophic forgetting and its +reasonable plasticity, achieving a new state-of-the-art. The code is available +at https://github.com/clovaai/ECLIPSE.",cs.CV,['cs.CV'] +LEOD: Label-Efficient Object Detection for Event Cameras,Ziyi Wu · Mathias Gehrig · Qing Lyu · Xudong Liu · Igor Gilitschenski,https://github.com/Wuziyi616/LEOD,https://arxiv.org/abs/2311.17286,,2311.17286.pdf,LEOD: Label-Efficient Object Detection for Event Cameras,"Object detection with event cameras benefits from the sensor's low latency +and high dynamic range. However, it is costly to fully label event streams for +supervised training due to their high temporal resolution. To reduce this cost, +we present LEOD, the first method for label-efficient event-based detection. +Our approach unifies weakly- and semi-supervised object detection with a +self-training mechanism. We first utilize a detector pre-trained on limited +labels to produce pseudo ground truth on unlabeled events. Then, the detector +is re-trained with both real and generated labels. Leveraging the temporal +consistency of events, we run bi-directional inference and apply tracking-based +post-processing to enhance the quality of pseudo labels. To stabilize training +against label noise, we further design a soft anchor assignment strategy. We +introduce new experimental protocols to evaluate the task of label-efficient +event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms +supervised baselines across various labeling ratios. For example, on Gen1, it +improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx, +RVT-S with 10% labels even surpasses its fully-supervised counterpart using +100% labels. LEOD maintains its effectiveness even when all labeled data are +available, reaching new state-of-the-art results. Finally, we show that our +method readily scales to improve larger detectors as well. Code is released at +https://github.com/Wuziyi616/LEOD",cs.CV,['cs.CV'] +Adapters Strike Back,Jan-Martin Steitz · Stefan Roth, ,,https://strikefans.com/the-ink-black-heart-has-wrapped/,,,,,nan +CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras,Sachin Shah · Matthew Chan · Haoming Cai · Jingxi Chen · Sakshum Kulshrestha · Chahat Deep Singh · Yiannis Aloimonos · Christopher Metzler, ,https://arxiv.org/abs/2404.11511,,2404.11511.pdf,"Event Cameras Meet SPADs for High-Speed, Low-Bandwidth Imaging","Traditional cameras face a trade-off between low-light performance and +high-speed imaging: longer exposure times to capture sufficient light results +in motion blur, whereas shorter exposures result in Poisson-corrupted noisy +images. While burst photography techniques help mitigate this tradeoff, +conventional cameras are fundamentally limited in their sensor noise +characteristics. Event cameras and single-photon avalanche diode (SPAD) sensors +have emerged as promising alternatives to conventional cameras due to their +desirable properties. SPADs are capable of single-photon sensitivity with +microsecond temporal resolution, and event cameras can measure brightness +changes up to 1 MHz with low bandwidth requirements. We show that these +properties are complementary, and can help achieve low-light, high-speed image +reconstruction with low bandwidth requirements. We introduce a sensor fusion +framework to combine SPADs with event cameras to improves the reconstruction of +high-speed, low-light scenes while reducing the high bandwidth cost associated +with using every SPAD frame. Our evaluation, on both synthetic and real sensor +data, demonstrates significant enhancements ( > 5 dB PSNR) in reconstructing +low-light scenes at high temporal resolution (100 kHz) compared to conventional +cameras. Event-SPAD fusion shows great promise for real-world applications, +such as robotics or medical imaging.",eess.IV,"['eess.IV', 'cs.CV']" +Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians,Yuelang Xu · Benwang Chen · Zhe Li · Hongwen Zhang · Lizhen Wang · Zerong Zheng · Yebin Liu,https://yuelangx.github.io/gaussianheadavatar,https://arxiv.org/abs/2312.03029,,2312.03029.pdf,Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians,"Creating high-fidelity 3D head avatars has always been a research hotspot, +but there remains a great challenge under lightweight sparse view setups. In +this paper, we propose Gaussian Head Avatar represented by controllable 3D +Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D +Gaussians and a fully learned MLP-based deformation field to capture complex +expressions. The two parts benefit each other, thereby our method can model +fine-grained dynamic details while ensuring expression accuracy. Furthermore, +we devise a well-designed geometry-guided initialization strategy based on +implicit SDF and Deep Marching Tetrahedra for the stability and convergence of +the training procedure. Experiments show our approach outperforms other +state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering +quality at 2K resolution even under exaggerated expressions.",cs.CV,"['cs.CV', 'cs.GR']" +Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis,Yuchao Gu · Xintao Wang · Yixiao Ge · Ying Shan · Mike Zheng Shou, ,https://ar5iv.labs.arxiv.org/html/2310.01218,,2310.01218.pdf,Making LLaMA SEE and Draw with SEED Tokenizer,"The great success of Large Language Models (LLMs) has expanded the potential +of multimodality, contributing to the gradual evolution of General Artificial +Intelligence (AGI). A true AGI agent should not only possess the capability to +perform predefined multi-tasks but also exhibit emergent abilities in an +open-world context. However, despite the considerable advancements made by +recent multimodal LLMs, they still fall short in effectively unifying +comprehension and generation tasks, let alone open-world emergent abilities. We +contend that the key to overcoming the present impasse lies in enabling text +and images to be represented and processed interchangeably within a unified +autoregressive Transformer. To this end, we introduce SEED, an elaborate image +tokenizer that empowers LLMs with the ability to SEE and Draw at the same time. +We identify two crucial design principles: (1) Image tokens should be +independent of 2D physical patch positions and instead be produced with a 1D +causal dependency, exhibiting intrinsic interdependence that aligns with the +left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens +should capture high-level semantics consistent with the degree of semantic +abstraction in words, and be optimized for both discriminativeness and +reconstruction during the tokenizer training phase. With SEED tokens, LLM is +able to perform scalable multimodal autoregression under its original training +recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by +large-scale pretraining and instruction tuning on the interleaved textual and +visual data, demonstrating impressive performance on a broad range of +multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has +exhibited compositional emergent abilities such as multi-turn in-context +multimodal generation, acting like your AI assistant.",cs.CV,['cs.CV'] +HDRFlow: Real-Time HDR Video Reconstruction with Large Motions,Gangwei Xu · Yujin Wang · Jinwei Gu · Tianfan Xue · Xin Yang, ,https://arxiv.org/abs/2403.03447,,2403.03447.pdf,HDRFlow: Real-Time HDR Video Reconstruction with Large Motions,"Reconstructing High Dynamic Range (HDR) video from image sequences captured +with alternating exposures is challenging, especially in the presence of large +camera or object motion. Existing methods typically align low dynamic range +sequences using optical flow or attention mechanism for deghosting. However, +they often struggle to handle large complex motions and are computationally +expensive. To address these challenges, we propose a robust and efficient flow +estimator tailored for real-time HDR video reconstruction, named HDRFlow. +HDRFlow has three novel designs: an HDR-domain alignment loss (HALoss), an +efficient flow network with a multi-size large kernel (MLK), and a new HDR flow +training scheme. The HALoss supervises our flow network to learn an +HDR-oriented flow for accurate alignment in saturated and dark regions. The MLK +can effectively model large motions at a negligible cost. In addition, we +incorporate synthetic data, Sintel, into our training dataset, utilizing both +its provided forward flow and backward flow generated by us to supervise our +flow network, enhancing our performance in large motion regions. Extensive +experiments demonstrate that our HDRFlow outperforms previous methods on +standard benchmarks. To the best of our knowledge, HDRFlow is the first +real-time HDR video reconstruction method for video sequences captured with +alternating exposures, capable of processing 720p resolution inputs at 25ms.",cs.CV,['cs.CV'] +LiSA: LiDAR Localization with Semantic Awareness,Bochun Yang · Zijun Li · Wen Li · zhipeng cai · Chenglu Wen · Yu Zang · Matthias Mueller · Cheng Wang, ,https://arxiv.org/abs/2402.18934,,2402.18934.pdf,RELEAD: Resilient Localization with Enhanced LiDAR Odometry in Adverse Environments,"LiDAR-based localization is valuable for applications like mining surveys and +underground facility maintenance. However, existing methods can struggle when +dealing with uninformative geometric structures in challenging scenarios. This +paper presents RELEAD, a LiDAR-centric solution designed to address +scan-matching degradation. Our method enables degeneracy-free point cloud +registration by solving constrained ESIKF updates in the front end and +incorporates multisensor constraints, even when dealing with outlier +measurements, through graph optimization based on Graduated Non-Convexity +(GNC). Additionally, we propose a robust Incremental Fixed Lag Smoother (rIFL) +for efficient GNC-based optimization. RELEAD has undergone extensive evaluation +in degenerate scenarios and has outperformed existing state-of-the-art +LiDAR-Inertial odometry and LiDAR-Visual-Inertial odometry methods.",cs.RO,['cs.RO'] +Language Models as Black-Box Optimizers for Vision-Language Models,Shihong Liu · Samuel Yu · Zhiqiu Lin · Deepak Pathak · Deva Ramanan,https://llm-can-optimize-vlm.github.io/,https://arxiv.org/abs/2309.05950,,2309.05950.pdf,Language Models as Black-Box Optimizers for Vision-Language Models,"Vision-language models (VLMs) pre-trained on web-scale datasets have +demonstrated remarkable capabilities on downstream tasks when fine-tuned with +minimal data. However, many VLMs rely on proprietary data and are not +open-source, which restricts the use of white-box approaches for fine-tuning. +As such, we aim to develop a black-box approach to optimize VLMs through +natural language prompts, thereby avoiding the need to access model parameters, +feature embeddings, or even output logits. We propose employing chat-based LLMs +to search for the best text prompt for VLMs. Specifically, we adopt an +automatic hill-climbing procedure that converges to an effective prompt by +evaluating the performance of current prompts and asking LLMs to refine them +based on textual feedback, all within a conversational process without +human-in-the-loop. In a challenging 1-shot image classification setup, our +simple approach surpasses the white-box continuous prompting method (CoOp) by +an average of 1.5% across 11 datasets including ImageNet. Our approach also +outperforms both human-engineered and LLM-generated prompts. We highlight the +advantage of conversational feedback that incorporates both positive and +negative prompts, suggesting that LLMs can utilize the implicit gradient +direction in textual feedback for a more efficient search. In addition, we find +that the text prompts generated through our strategy are not only more +interpretable but also transfer well across different VLM architectures in a +black-box manner. Lastly, we apply our framework to optimize the +state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt +inversion, and personalization.",cs.CL,"['cs.CL', 'cs.CV', 'cs.LG', 'cs.MM']" +The Neglected Tails of Vision-Language Models,Shubham Parashar · Tian Liu · Zhiqiu Lin · Xiangjue Dong · Yanan Li · James Caverlee · Deva Ramanan · Shu Kong,https://shubhamprshr27.github.io/neglected-tails-of-vlms/,https://arxiv.org/abs/2401.12425,,2401.12425.pdf,The Neglected Tails in Vision-Language Models,"Vision-language models (VLMs) excel in zero-shot recognition but their +performance varies greatly across different visual concepts. For example, +although CLIP achieves impressive accuracy on ImageNet (60-80%), its +performance drops below 10% for more than ten concepts like night snake, +presumably due to their limited presence in the pretraining data. However, +measuring the frequency of concepts in VLMs' large-scale datasets is +challenging. We address this by using large language models (LLMs) to count the +number of pretraining texts that contain synonyms of these concepts. Our +analysis confirms that popular datasets, such as LAION, exhibit a long-tailed +concept distribution, yielding biased performance in VLMs. We also find that +downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and +text-to-image models (e.g., Stable Diffusion), often fail to recognize or +generate images of rare concepts identified by our method. To mitigate the +imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented +Learning (REAL). First, instead of prompting VLMs using the original class +names, REAL uses their most frequent synonyms found in pretraining texts. This +simple change already outperforms costly human-engineered and LLM-enriched +prompts over nine benchmark datasets. Second, REAL trains a linear classifier +on a small yet balanced set of pretraining data retrieved using concept +synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage +and 10,000x less training time!",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data,Chengxiang Fan · Muzhi Zhu · Hao Chen · Yang Liu · Weijia Wu · Huaqi Zhang · Chunhua Shen,https://github.com/aim-uofa/DiverGen,https://arxiv.org/abs/2405.10185,,2405.10185.pdf,DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data,"Instance segmentation is data-hungry, and as model capacity increases, data +scale becomes crucial for improving the accuracy. Most instance segmentation +datasets today require costly manual annotation, limiting their data scale. +Models trained on such data are prone to overfitting on the training set, +especially for those rare categories. While recent works have delved into +exploiting generative models to create synthetic datasets for data +augmentation, these approaches do not efficiently harness the full potential of +generative models. + To address these issues, we introduce a more efficient strategy to construct +generative datasets for data augmentation, termed DiverGen. Firstly, we provide +an explanation of the role of generative data from the perspective of +distribution discrepancy. We investigate the impact of different data on the +distribution learned by the model. We argue that generative data can expand the +data distribution that the model can learn, thus mitigating overfitting. +Additionally, we find that the diversity of generative data is crucial for +improving model performance and enhance it through various strategies, +including category diversity, prompt diversity, and generative model diversity. +With these strategies, we can scale the data to millions while maintaining the +trend of model performance improvement. On the LVIS dataset, DiverGen +significantly outperforms the strong model X-Paste, achieving +1.1 box AP and ++1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare +categories.",cs.CV,['cs.CV'] +Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension,Quan Liu · Hongzi Zhu · Zhenxi Wang · Yunsong Zhou · Shan Chang · Minyi Guo, ,https://arxiv.org/abs/2403.03532,,2403.03532.pdf,Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension,"Registration of point clouds collected from a pair of distant vehicles +provides a comprehensive and accurate 3D view of the driving scenario, which is +vital for driving safety related applications, yet existing literature suffers +from the expensive pose label acquisition and the deficiency to generalize to +new data distributions. In this paper, we propose EYOC, an unsupervised distant +point cloud registration method that adapts to new point cloud distributions on +the fly, requiring no global pose labels. The core idea of EYOC is to train a +feature extractor in a progressive fashion, where in each round, the feature +extractor, trained with near point cloud pairs, can label slightly farther +point cloud pairs, enabling self-supervision on such far point cloud pairs. +This process continues until the derived extractor can be used to register +distant point clouds. Particularly, to enable high-fidelity correspondence +label generation, we devise an effective spatial filtering scheme to select the +most representative correspondences to register a point cloud pair, and then +utilize the aligned point clouds to discover more correct correspondences. +Experiments show that EYOC can achieve comparable performance with +state-of-the-art supervised methods at a lower training cost. Moreover, it +outwits supervised methods regarding generalization performance on new data +distributions.",cs.CV,['cs.CV'] +Fine-grained Bipartite Concept Factorization for Clustering,Chong Peng · Pengfei Zhang · Yongyong Chen · zhao kang · Chenglizhao Chen · Qiang Cheng, ,,https://ieeexplore.ieee.org/document/10506642,,,,,nan +SuperNormal: Neural Surface Reconstruction via Multi-View Normal Integration,Xu Cao · Takafumi Taketomi, ,https://arxiv.org/abs/2312.04803,,2312.04803.pdf,SuperNormal: Neural Surface Reconstruction via Multi-View Normal Integration,"We present SuperNormal, a fast, high-fidelity approach to multi-view 3D +reconstruction using surface normal maps. With a few minutes, SuperNormal +produces detailed surfaces on par with 3D scanners. We harness volume rendering +to optimize a neural signed distance function (SDF) powered by multi-resolution +hash encoding. To accelerate training, we propose directional finite difference +and patch-based ray marching to approximate the SDF gradients numerically. +While not compromising reconstruction quality, this strategy is nearly twice as +efficient as analytical gradients and about three times faster than +axis-aligned finite difference. Experiments on the benchmark dataset +demonstrate the superiority of SuperNormal in efficiency and accuracy compared +to existing multi-view photometric stereo methods. On our captured objects, +SuperNormal produces more fine-grained geometry than recent neural 3D +reconstruction methods.",cs.CV,['cs.CV'] +SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers,Jonathan F. Carter · Joao Jorge · Oliver Gibson · Lionel Tarassenko, ,https://arxiv.org/abs/2404.03831,,2404.03831.pdf,SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers,"Advances in camera-based physiological monitoring have enabled the robust, +non-contact measurement of respiration and the cardiac pulse, which are known +to be indicative of the sleep stage. This has led to research into camera-based +sleep monitoring as a promising alternative to ""gold-standard"" polysomnography, +which is cumbersome, expensive to administer, and hence unsuitable for +longer-term clinical studies. In this paper, we introduce SleepVST, a +transformer model which enables state-of-the-art performance in camera-based +sleep stage classification (sleep staging). After pre-training on contact +sensor data, SleepVST outperforms existing methods for cardio-respiratory sleep +staging on the SHHS and MESA datasets, achieving total Cohen's kappa scores of +0.75 and 0.77 respectively. We then show that SleepVST can be successfully +transferred to cardio-respiratory waveforms extracted from video, enabling +fully contact-free sleep staging. Using a video dataset of 50 nights, we +achieve a total accuracy of 78.8\% and a Cohen's $\kappa$ of 0.71 in four-class +video-based sleep staging, setting a new state-of-the-art in the domain.",cs.CV,"['cs.CV', 'cs.HC', 'q-bio.NC']" +Progress-Aware Online Action Segmentation for Egocentric Procedural Task Videos,Yuhan Shen · Ehsan Elhamifar, ,https://arxiv.org/abs/2404.01933,,,PREGO: online mistake detection in PRocedural EGOcentric videos,"Promptly identifying procedural errors from egocentric videos in an online +setting is highly challenging and valuable for detecting mistakes as soon as +they happen. This capability has a wide range of applications across various +fields, such as manufacturing and healthcare. The nature of procedural mistakes +is open-set since novel types of failures might occur, which calls for +one-class classifiers trained on correctly executed procedures. However, no +technique can currently detect open-set procedural mistakes online. We propose +PREGO, the first online one-class classification model for mistake detection in +PRocedural EGOcentric videos. PREGO is based on an online action recognition +component to model the current action, and a symbolic reasoning module to +predict the next actions. Mistake detection is performed by comparing the +recognized current action with the expected future one. We evaluate PREGO on +two procedural egocentric video datasets, Assembly101 and Epic-tent, which we +adapt for online benchmarking of procedural mistake detection to establish +suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, +respectively.",cs.CV,['cs.CV'] +Efficient Solution of Point-Line Absolute Pose,Petr Hruby · Timothy Duff · Marc Pollefeys,https://github.com/petrhruby97/efficient_absolute,https://arxiv.org/abs/2404.16552,,2404.16552.pdf,Efficient Solution of Point-Line Absolute Pose,"We revisit certain problems of pose estimation based on 3D--2D +correspondences between features which may be points or lines. Specifically, we +address the two previously-studied minimal problems of estimating camera +extrinsics from $p \in \{ 1, 2 \}$ point--point correspondences and $l=3-p$ +line--line correspondences. To the best of our knowledge, all of the +previously-known practical solutions to these problems required computing the +roots of degree $\ge 4$ (univariate) polynomials when $p=2$, or degree $\ge 8$ +polynomials when $p=1.$ We describe and implement two elementary solutions +which reduce the degrees of the needed polynomials from $4$ to $2$ and from $8$ +to $4$, respectively. We show experimentally that the resulting solvers are +numerically stable and fast: when compared to the previous state-of-the art, we +may obtain nearly an order of magnitude speedup. The code is available at +\url{https://github.com/petrhruby97/efficient\_absolute}",cs.CV,"['cs.CV', '68T45', 'I.4.5']" +ProTeCt: Prompt Tuning for Taxonomic Open Set Classification,Tz-Ying Wu · Chih-Hui Ho · Nuno Vasconcelos,http://www.svcl.ucsd.edu/projects/protect/,https://arxiv.org/abs/2306.02240,,2306.02240.pdf,ProTeCt: Prompt Tuning for Taxonomic Open Set Classification,"Visual-language foundation models, like CLIP, learn generalized +representations that enable zero-shot open-set classification. Few-shot +adaptation methods, based on prompt tuning, have been shown to further improve +performance on downstream datasets. However, these methods do not fare well in +the taxonomic open set (TOS) setting, where the classifier is asked to make +predictions from label sets across different levels of semantic granularity. +Frequently, they infer incorrect labels at coarser taxonomic class levels, even +when the inference at the leaf level (original class labels) is correct. To +address this problem, we propose a prompt tuning technique that calibrates the +hierarchical consistency of model predictions. A set of metrics of hierarchical +consistency, the Hierarchical Consistent Accuracy (HCA) and the Mean Treecut +Accuracy (MTA), are first proposed to evaluate TOS model performance. A new +Prompt Tuning for Hierarchical Consistency (ProTeCt) technique is then proposed +to calibrate classification across label set granularities. Results show that +ProTeCt can be combined with existing prompt tuning methods to significantly +improve TOS classification without degrading the leaf level classification +performance.",cs.CV,['cs.CV'] +In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing,Yiran Xu · Zhixin Shu · Cameron Smith · Seoung Wug Oh · Jia-Bin Huang,https://in-n-out-3d.github.io/,,https://www.youtube.com/watch?v=JGbLEEANtnI,,,,,nan +On the Faithfulness of Vision Transformer Explanations,Junyi Wu · Weitai Kang · Hao Tang · Yuan Hong · Yan Yan, ,https://arxiv.org/abs/2404.01415,,2404.01415.pdf,On the Faithfulness of Vision Transformer Explanations,"To interpret Vision Transformers, post-hoc explanations assign salience +scores to input pixels, providing human-understandable heatmaps. However, +whether these interpretations reflect true rationales behind the model's output +is still underexplored. To address this gap, we study the faithfulness +criterion of explanations: the assigned salience scores should represent the +influence of the corresponding input pixels on the model's predictions. To +evaluate faithfulness, we introduce Salience-guided Faithfulness Coefficient +(SaCo), a novel evaluation metric leveraging essential information of salience +distribution. Specifically, we conduct pair-wise comparisons among distinct +pixel groups and then aggregate the differences in their salience scores, +resulting in a coefficient that indicates the explanation's degree of +faithfulness. Our explorations reveal that current metrics struggle to +differentiate between advanced explanation methods and Random Attribution, +thereby failing to capture the faithfulness property. In contrast, our proposed +SaCo offers a reliable faithfulness measurement, establishing a robust metric +for interpretations. Furthermore, our SaCo demonstrates that the use of +gradient and multi-layer aggregation can markedly enhance the faithfulness of +attention-based explanation, shedding light on potential paths for advancing +Vision Transformer explainability.",cs.CV,['cs.CV'] +Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models,Gihyun Kwon · Simon Jenni · Ding Li · Joon-Young Lee · Jong Chul Ye · Fabian Caba Heilbron, ,https://arxiv.org/abs/2404.03913,,2404.03913.pdf,Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models,"While there has been significant progress in customizing text-to-image +generation models, generating images that combine multiple personalized +concepts remains challenging. In this work, we introduce Concept Weaver, a +method for composing customized text-to-image diffusion models at inference +time. Specifically, the method breaks the process into two steps: creating a +template image aligned with the semantics of input prompts, and then +personalizing the template using a concept fusion strategy. The fusion strategy +incorporates the appearance of the target concepts into the template image +while retaining its structural details. The results indicate that our method +can generate multiple custom concepts with higher identity fidelity compared to +alternative approaches. Furthermore, the method is shown to seamlessly handle +more than two concepts and closely follow the semantic meaning of the input +prompt without blending appearances across different subjects.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +SPAD: Spatially Aware Multiview Diffusers,Yash Kant · Aliaksandr Siarohin · Ziyi Wu · Michael Vasilkovsky · Guocheng Qian · Jian Ren · Riza Alp Guler · Bernard Ghanem · Sergey Tulyakov · Igor Gilitschenski,https://yashkant.github.io/spad,https://arxiv.org/abs/2402.05235,,2402.05235.pdf,SPAD : Spatially Aware Multiview Diffusers,"We present SPAD, a novel approach for creating consistent multi-view images +from text prompts or single images. To enable multi-view generation, we +repurpose a pretrained 2D diffusion model by extending its self-attention +layers with cross-view interactions, and fine-tune it on a high quality subset +of Objaverse. We find that a naive extension of the self-attention proposed in +prior work (e.g. MVDream) leads to content copying between views. Therefore, we +explicitly constrain the cross-view attention based on epipolar geometry. To +further enhance 3D consistency, we utilize Plucker coordinates derived from +camera rays and inject them as positional encoding. This enables SPAD to reason +over spatial proximity in 3D well. In contrast to recent works that can only +generate views at fixed azimuth and elevation, SPAD offers full camera control +and achieves state-of-the-art results in novel view synthesis on unseen objects +from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate +that text-to-3D generation using SPAD prevents the multi-face Janus issue. See +more details at our webpage: https://yashkant.github.io/spad",cs.CV,['cs.CV'] +Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth,Zhaoyang Sun · Shengwu Xiong · Yaxiong Chen · Yi Rong, ,https://arxiv.org/abs/2405.17240,,2405.17240.pdf,Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth,"The absence of real targets to guide the model training is one of the main +problems with the makeup transfer task. Most existing methods tackle this +problem by synthesizing pseudo ground truths (PGTs). However, the generated +PGTs are often sub-optimal and their imprecision will eventually lead to +performance degradation. To alleviate this issue, in this paper, we propose a +novel Content-Style Decoupled Makeup Transfer (CSD-MT) method, which works in a +purely unsupervised manner and thus eliminates the negative effects of +generating PGTs. Specifically, based on the frequency characteristics analysis, +we assume that the low-frequency (LF) component of a face image is more +associated with its makeup style information, while the high-frequency (HF) +component is more related to its content details. This assumption allows CSD-MT +to decouple the content and makeup style information in each face image through +the frequency decomposition. After that, CSD-MT realizes makeup transfer by +maximizing the consistency of these two types of information between the +transferred result and input images, respectively. Two newly designed loss +functions are also introduced to further improve the transfer performance. +Extensive quantitative and qualitative analyses show the effectiveness of our +CSD-MT method. Our code is available at +https://github.com/Snowfallingplum/CSD-MT.",cs.CV,['cs.CV'] +SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation,Yamei Chen · Yan Di · Guangyao Zhai · Fabian Manhardt · Chenyangguang Zhang · Ruida Zhang · Federico Tombari · Nassir Navab · Benjamin Busam, ,https://arxiv.org/abs/2311.11125,,2311.11125.pdf,SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation,"Category-level object pose estimation, aiming to predict the 6D pose and 3D +size of objects from known categories, typically struggles with large +intra-class shape variation. Existing works utilizing mean shapes often fall +short of capturing this variation. To address this issue, we present +SecondPose, a novel approach integrating object-specific geometric features +with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 +in providing SE(3)-consistent semantic features, we hierarchically extract two +types of SE(3)-invariant geometric features to further encapsulate +local-to-global object-specific information. These geometric features are then +point-aligned with DINOv2 features to establish a consistent object +representation under SE(3) transformations, facilitating the mapping from +camera space to the pre-defined canonical space, thus further enhancing pose +estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose +achieves a 12.4% leap forward over the state-of-the-art. Moreover, on a more +complex dataset HouseCat6D which provides photometrically challenging objects, +SecondPose still surpasses other competitors by a large margin.",cs.CV,['cs.CV'] +Rethinking FID: Towards a Better Evaluation Metric for Image Generation,Sadeep Jayasumana · Srikumar Ramalingam · Andreas Veit · Daniel Glasner · Ayan Chakrabarti · Sanjiv Kumar, ,https://arxiv.org/abs/2401.09603,,2401.09603.pdf,Rethinking FID: Towards a Better Evaluation Metric for Image Generation,"As with many machine learning problems, the progress of image generation +methods hinges on good evaluation metrics. One of the most popular is the +Frechet Inception Distance (FID). FID estimates the distance between a +distribution of Inception-v3 features of real images, and those of images +generated by the algorithm. We highlight important drawbacks of FID: +Inception's poor representation of the rich and varied content generated by +modern text-to-image models, incorrect normality assumptions, and poor sample +complexity. We call for a reevaluation of FID's use as the primary quality +metric for generated images. We empirically demonstrate that FID contradicts +human raters, it does not reflect gradual improvement of iterative +text-to-image models, it does not capture distortion levels, and that it +produces inconsistent results when varying the sample size. We also propose an +alternative new metric, CMMD, based on richer CLIP embeddings and the maximum +mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased +estimator that does not make any assumptions on the probability distribution of +the embeddings and is sample efficient. Through extensive experiments and +analysis, we demonstrate that FID-based evaluations of text-to-image models may +be unreliable, and that CMMD offers a more robust and reliable assessment of +image quality.",cs.CV,['cs.CV'] +CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation,Xi Liu · Ying Guo · Cheng Zhen · Tong Li · Yingying Ao · Pengfei Yan,https://customlistener.github.io/,https://arxiv.org/abs/2403.00274,,2403.00274.pdf,CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation,"Listening head generation aims to synthesize a non-verbal responsive listener +head by modeling the correlation between the speaker and the listener in +dynamic conversion.The applications of listener agent generation in virtual +interaction have promoted many works achieving the diverse and fine-grained +motion generation. However, they can only manipulate motions through simple +emotional labels, but cannot freely control the listener's motions. Since +listener agents should have human-like attributes (e.g. identity, personality) +which can be freely customized by users, this limits their realism. In this +paper, we propose a user-friendly framework called CustomListener to realize +the free-form text prior guided listener generation. To achieve +speaker-listener coordination, we design a Static to Dynamic Portrait module +(SDP), which interacts with speaker information to transform static text into +dynamic portrait token with completion rhythm and amplitude information. To +achieve coherence between segments, we design a Past Guided Generation Module +(PGG) to maintain the consistency of customized listener attributes through the +motion prior, and utilize a diffusion-based structure conditioned on the +portrait token and the motion prior to realize the controllable generation. To +train and evaluate our model, we have constructed two text-annotated listening +head datasets based on ViCo and RealTalk, which provide text-video paired +labels. Extensive experiments have verified the effectiveness of our model.",cs.CV,"['cs.CV', 'cs.SD', 'eess.AS']" +Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields,TIANQI LIU · Xinyi Ye · Min Shi · Zihao Huang · Zhiyu Pan · Zhan Peng · Zhiguo Cao, ,https://arxiv.org/abs/2404.17528,,2404.17528.pdf,Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields,"Generalizable NeRF aims to synthesize novel views for unseen scenes. Common +practices involve constructing variance-based cost volumes for geometry +reconstruction and encoding 3D descriptors for decoding novel views. However, +existing methods show limited generalization ability in challenging conditions +due to inaccurate geometry, sub-optimal descriptors, and decoding strategies. +We address these issues point by point. First, we find the variance-based cost +volume exhibits failure patterns as the features of pixels corresponding to the +same point can be inconsistent across different views due to occlusions or +reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to +amplify the contribution of consistent pixel pairs and suppress inconsistent +ones. Unlike previous methods that solely fuse 2D features into descriptors, +our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D +context into descriptors through spatial and inter-view interaction. When +decoding the descriptors, we observe the two existing decoding strategies excel +in different areas, which are complementary. A Consistency-Aware Fusion (CAF) +strategy is proposed to leverage the advantages of both. We incorporate the +above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware +Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains +state-of-the-art performance across multiple datasets. Code is available at +https://github.com/TQTQliu/GeFu .",cs.CV,['cs.CV'] +Rethinking Few-shot 3D Point Cloud Semantic Segmentation,Zhaochong An · Guolei Sun · Yun Liu · Fayao Liu · Zongwei Wu · Dan Wang · Luc Van Gool · Serge Belongie, ,https://arxiv.org/abs/2403.00592,,2403.00592.pdf,Rethinking Few-shot 3D Point Cloud Semantic Segmentation,"This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS), +with a focus on two significant issues in the state-of-the-art: foreground +leakage and sparse point distribution. The former arises from non-uniform point +sampling, allowing models to distinguish the density disparities between +foreground and background for easier segmentation. The latter results from +sampling only 2,048 points, limiting semantic information and deviating from +the real-world practice. To address these issues, we introduce a standardized +FS-PCS setting, upon which a new benchmark is built. Moreover, we propose a +novel FS-PCS model. While previous methods are based on feature optimization by +mainly refining support features to enhance prototypes, our method is based on +correlation optimization, referred to as Correlation Optimization Segmentation +(COSeg). Specifically, we compute Class-specific Multi-prototypical Correlation +(CMC) for each query point, representing its correlations to category +prototypes. Then, we propose the Hyper Correlation Augmentation (HCA) module to +enhance CMC. Furthermore, tackling the inherent property of few-shot training +to incur base susceptibility for models, we propose to learn non-parametric +prototypes for the base classes during training. The learned base prototypes +are used to calibrate correlations for the background class through a Base +Prototypes Calibration (BPC) module. Experiments on popular datasets +demonstrate the superiority of COSeg over existing methods. The code is +available at: https://github.com/ZhaochongAn/COSeg",cs.CV,['cs.CV'] +MESA: Matching Everything by Segmenting Anything,Yesheng Zhang · Xu Zhao, ,https://arxiv.org/abs/2401.16741v1,,2401.16741v1.pdf,MESA: Matching Everything by Segmenting Anything,"Feature matching is a crucial task in the field of computer vision, which +involves finding correspondences between images. Previous studies achieve +remarkable performance using learning-based feature comparison. However, the +pervasive presence of matching redundancy between images gives rise to +unnecessary and error-prone computations in these methods, imposing limitations +on their accuracy. To address this issue, we propose MESA, a novel approach to +establish precise area (or region) matches for efficient matching redundancy +reduction. MESA first leverages the advanced image understanding capability of +SAM, a state-of-the-art foundation model for image segmentation, to obtain +image areas with implicit semantic. Then, a multi-relational graph is proposed +to model the spatial structure of these areas and construct their scale +hierarchy. Based on graphical models derived from the graph, the area matching +is reformulated as an energy minimization task and effectively resolved. +Extensive experiments demonstrate that MESA yields substantial precision +improvement for multiple point matchers in indoor and outdoor downstream tasks, +e.g. +13.61% for DKM in indoor pose estimation.",cs.CV,['cs.CV'] +DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets,Harsh Rangwani · Pradipto Mondal · Mayank Mishra · Ashish Asokan · R. Venkatesh Babu,https://rangwani-harsh.github.io/DeiT-LT/,https://arxiv.org/abs/2404.02900,,2404.02900.pdf,DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets,"Vision Transformer (ViT) has emerged as a prominent architecture for various +computer vision tasks. In ViT, we divide the input image into patch tokens and +process them through a stack of self attention blocks. However, unlike +Convolutional Neural Networks (CNN), ViTs simple architecture has no +informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a +large amount of data for pre-training. Various data efficient approaches (DeiT) +have been proposed to train ViT on balanced datasets effectively. However, +limited literature discusses the use of ViT for datasets with long-tailed +imbalances. In this work, we introduce DeiT-LT to tackle the problem of +training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an +efficient and effective way of distillation from CNN via distillation DIST +token by using out-of-distribution images and re-weighting the distillation +loss to enhance focus on tail classes. This leads to the learning of local +CNN-like features in early ViT blocks, improving generalization for tail +classes. Further, to mitigate overfitting, we propose distilling from a flat +CNN teacher, which leads to learning low-rank generalizable features for DIST +tokens across all ViT blocks. With the proposed DeiT-LT scheme, the +distillation DIST token becomes an expert on the tail classes, and the +classifier CLS token becomes an expert on the head classes. The experts help to +effectively learn features corresponding to both the majority and minority +classes using a distinct set of tokens within the same ViT architecture. We +show the effectiveness of DeiT-LT for training ViT from scratch on datasets +ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis,Jiapeng Tang · Yinyu Nie · Lev Markhasin · Angela Dai · Justus Thies · Matthias Nießner,https://tangjiapeng.github.io/projects/DiffuScene/,,https://justusthies.github.io/posts/diffuscene/,,,,,nan +TokenCompose: Text-to-Image Diffusion with Token-level Supervision,Zirui Wang · Zhizhou Sha · Zheng Ding · Yilin Wang · Zhuowen Tu,https://mlpc-ucsd.github.io/TokenCompose/,https://arxiv.org/abs/2312.03626,,2312.03626.pdf,TokenCompose: Grounding Diffusion with Token-level Supervision,"We present TokenCompose, a Latent Diffusion Model for text-to-image +generation that achieves enhanced consistency between user-specified text +prompts and model-generated images. Despite its tremendous success, the +standard denoising process in the Latent Diffusion Model takes text prompts as +conditions only, absent explicit constraint for the consistency between the +text prompts and the image contents, leading to unsatisfactory results for +composing multiple object categories. TokenCompose aims to improve +multi-category instance composition by introducing the token-wise consistency +terms between the image content and object segmentation maps in the finetuning +stage. TokenCompose can be applied directly to the existing training pipeline +of text-conditioned diffusion models without extra human labeling information. +By finetuning Stable Diffusion, the model exhibits significant improvements in +multi-category instance composition and enhanced photorealism for its generated +images.",cs.CV,['cs.CV'] +Unbiased Estimator for Distorted Conic in Camera Calibration,Chaehyeon Song · Jaeho Shin · Myung-Hwan Jeon · Jongwoo Lim · Ayoung Kim,https://github.com/chaehyeonsong/discocal,https://arxiv.org/abs/2403.04583,,2403.04583.pdf,Unbiased Estimator for Distorted Conics in Camera Calibration,"In the literature, points and conics have been major features for camera +geometric calibration. Although conics are more informative features than +points, the loss of the conic property under distortion has critically limited +the utility of conic features in camera calibration. Many existing approaches +addressed conic-based calibration by ignoring distortion or introducing 3D +spherical targets to circumvent this limitation. In this paper, we present a +novel formulation for conic-based calibration using moments. Our derivation is +based on the mathematical finding that the first moment can be estimated +without bias even under distortion. This allows us to track moment changes +during projection and distortion, ensuring the preservation of the first moment +of the distorted conic. With an unbiased estimator, the circular patterns can +be accurately detected at the sub-pixel level and can now be fully exploited +for an entire calibration pipeline, resulting in significantly improved +calibration. The entire code is readily available from +https://github.com/ChaehyeonSong/discocal.",cs.CV,['cs.CV'] +Unleashing Channel Potential: Space-Frequency Selection Convolution for SAR Object Detection,Ke Li · Di Wang · Zhangyuan Hu · Wenxuan Zhu · Shaofeng Li · Quan Wang, ,https://arxiv.org/abs/2312.16943,,2312.16943.pdf,Multi-scale direction-aware SAR object detection network via global information fusion,"Deep learning has driven significant progress in object detection using +Synthetic Aperture Radar (SAR) imagery. Existing methods, while achieving +promising results, often struggle to effectively integrate local and global +information, particularly direction-aware features. This paper proposes +SAR-Net, a novel framework specifically designed for global fusion of +direction-aware information in SAR object detection. SAR-Net leverages two key +innovations: the Unity Compensation Mechanism (UCM) and the Direction-aware +Attention Module (DAM). UCM facilitates the establishment of complementary +relationships among features across different scales, enabling efficient global +information fusion and transmission. Additionally, DAM, through bidirectional +attention polymerization, captures direction-aware information, effectively +eliminating background interference. Extensive experiments demonstrate the +effectiveness of SAR-Net, achieving state-of-the-art results on aircraft +(SAR-AIRcraft-1.0) and ship datasets (SSDD, HRSID), confirming its +generalization capability and robustness.",cs.CV,['cs.CV'] +FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models,LIn Zhao · Tianchen Zhao · Zinan Lin · Xuefei Ning · Guohao Dai · Huazhong Yang · Yu Wang, ,https://arxiv.org/abs/2403.16379,,2403.16379.pdf,FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models,"In recent years, there has been significant progress in the development of +text-to-image generative models. Evaluating the quality of the generative +models is one essential step in the development process. Unfortunately, the +evaluation process could consume a significant amount of computational +resources, making the required periodic evaluation of model performance (e.g., +monitoring training progress) impractical. Therefore, we seek to improve the +evaluation efficiency by selecting the representative subset of the text-image +dataset. We systematically investigate the design choices, including the +selection criteria (textural features or image-based metrics) and the selection +granularity (prompt-level or set-level). We find that the insights from prior +work on subset selection for training data do not generalize to this problem, +and we propose FlashEval, an iterative search algorithm tailored to evaluation +data selection. We demonstrate the effectiveness of FlashEval on ranking +diffusion models with various configurations, including architectures, +quantization levels, and sampler schedules on COCO and DiffusionDB datasets. +Our searched 50-item subset could achieve comparable evaluation quality to the +randomly sampled 500-item subset for COCO annotations on unseen models, +achieving a 10x evaluation speedup. We release the condensed subset of these +commonly used datasets to help facilitate diffusion algorithm design and +evaluation, and open-source FlashEval as a tool for condensing future datasets, +accessible at https://github.com/thu-nics/FlashEval.",cs.CV,['cs.CV'] +Fair-VPT: Fair Visual Prompt Tuning for Image Classification,Sungho Park · Hyeran Byun, ,https://arxiv.org/abs/2404.05207,,2404.05207.pdf,iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection,"Recent progress has shown great potential of visual prompt tuning (VPT) when +adapting pre-trained vision transformers to various downstream tasks. However, +most existing solutions independently optimize prompts at each layer, thereby +neglecting the usage of task-relevant information encoded in prompt tokens +across layers. Additionally, existing prompt structures are prone to +interference from task-irrelevant noise in input images, which can do harm to +the sharing of task-relevant information. In this paper, we propose a novel VPT +approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic +connection (CDC) for input prompt tokens from adjacent layers, enabling +effective sharing of task-relevant information. Furthermore, we design a +dynamic aggregation (DA) module that facilitates selective sharing of +information between layers. The combination of CDC and DA enhances the +flexibility of the attention process within the VPT framework. Building upon +these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, +by automatically identifying salient image tokens, which are further enhanced +by prompt tokens in an additive manner. Extensive experiments on 24 image +classification and semantic segmentation benchmarks clearly demonstrate the +advantage of the proposed iVPT, compared to the state-of-the-art counterparts.",cs.CV,['cs.CV'] +"Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion",Junjiao Tian · Lavisha Aggarwal · Andrea Colaco · Zsolt Kira · Mar Gonzalez-Franco,https://sites.google.com/view/diffseg,https://arxiv.org/abs/2308.12469,,2308.12469.pdf,"Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion","Producing quality segmentation masks for images is a fundamental problem in +computer vision. Recent research has explored large-scale supervised training +to enable zero-shot segmentation on virtually any image style and unsupervised +training to enable segmentation without dense annotations. However, +constructing a model capable of segmenting anything in a zero-shot manner +without any annotations is still challenging. In this paper, we propose to +utilize the self-attention layers in stable diffusion models to achieve this +goal because the pre-trained stable diffusion model has learned inherent +concepts of objects within its attention layers. Specifically, we introduce a +simple yet effective iterative merging process based on measuring KL divergence +among attention maps to merge them into valid segmentation masks. The proposed +method does not require any training or language dependency to extract quality +segmentation for any images. On COCO-Stuff-27, our method surpasses the prior +unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% +in mean IoU. The project page is at +\url{https://sites.google.com/view/diffseg/home}.",cs.CV,['cs.CV'] +DPHMs: Diffusion Parametric Head Models for Depth-based Tracking,Jiapeng Tang · Angela Dai · Yinyu Nie · Lev Markhasin · Justus Thies · Matthias Nießner,https://tangjiapeng.github.io/projects/DPHMs/,https://arxiv.org/abs/2312.01068,,2312.01068.pdf,DPHMs: Diffusion Parametric Head Models for Depth-based Tracking,"We introduce Diffusion Parametric Head Models (DPHMs), a generative model +that enables robust volumetric head reconstruction and tracking from monocular +depth sequences. While recent volumetric head models, such as NPHMs, can now +excel in representing high-fidelity head geometries, tracking and +reconstructing heads from real-world single-view depth sequences remains very +challenging, as the fitting to partial and noisy observations is +underconstrained. To tackle these challenges, we propose a latent +diffusion-based prior to regularize volumetric head reconstruction and +tracking. This prior-based regularizer effectively constrains the identity and +expression codes to lie on the underlying latent manifold which represents +plausible head shapes. To evaluate the effectiveness of the diffusion-based +prior, we collect a dataset of monocular Kinect sequences consisting of various +complex facial expression motions and rapid transitions. We compare our method +to state-of-the-art tracking methods and demonstrate improved head identity +reconstruction as well as robust expression tracking.",cs.CV,['cs.CV'] +A Unified Approach for Text- and Image-guided 4D Scene Generation,Yufeng Zheng · Xueting Li · Koki Nagano · Sifei Liu · Otmar Hilliges · Shalini De Mello, ,https://arxiv.org/abs/2311.16854,,2311.16854.pdf,A Unified Approach for Text- and Image-guided 4D Scene Generation,"Large-scale diffusion generative models are greatly simplifying image, video +and 3D asset creation from user-provided text prompts and images. However, the +challenging problem of text-to-4D dynamic 3D scene generation with diffusion +guidance remains largely unexplored. We propose Dream-in-4D, which features a +novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D +diffusion guidance to effectively learn a high-quality static 3D asset in the +first stage; (2) a deformable neural radiance field that explicitly +disentangles the learned static asset from its deformation, preserving quality +during motion learning; and (3) a multi-resolution feature grid for the +deformation field with a displacement total variation loss to effectively learn +motion with video diffusion guidance in the second stage. Through a user +preference study, we demonstrate that our approach significantly advances image +and motion quality, 3D consistency and text fidelity for text-to-4D generation +compared to baseline approaches. Thanks to its motion-disentangled +representation, Dream-in-4D can also be easily adapted for controllable +generation where appearance is defined by one or multiple images, without the +need to modify the motion learning stage. Thus, our method offers, for the +first time, a unified approach for text-to-4D, image-to-4D and personalized 4D +generation tasks.",cs.CV,['cs.CV'] +Continuous Pose for Monocular Cameras in Neural Implicit Representation,Qi Ma · Danda Paudel · Ajad Chhatkuli · Luc Van Gool,https://github.com/qimaqi/Continuous-Pose-in-NeRF,https://arxiv.org/abs/2311.17119,,2311.17119.pdf,Continuous Pose for Monocular Cameras in Neural Implicit Representation,"In this paper, we showcase the effectiveness of optimizing monocular camera +poses as a continuous function of time. The camera poses are represented using +an implicit neural function which maps the given time to the corresponding +camera pose. The mapped camera poses are then used for the downstream tasks +where joint camera pose optimization is also required. While doing so, the +network parameters -- that implicitly represent camera poses -- are optimized. +We exploit the proposed method in four diverse experimental settings, namely, +(1) NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual +Simultaneous Localization and Mapping (vSLAM); and (4) vSLAM with IMUs. In all +four settings, the proposed method performs significantly better than the +compared baselines and the state-of-the-art methods. Additionally, using the +assumption of continuous motion, changes in pose may actually live in a +manifold that has lower than 6 degrees of freedom (DOF) is also realized. We +call this low DOF motion representation as the \emph{intrinsic motion} and use +the approach in vSLAM settings, showing impressive camera tracking performance.",cs.CV,['cs.CV'] +Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning,Christopher Liao · Theodoros Tsiligkaridis · Brian Kulis, ,https://arxiv.org/abs/2311.13612,,2311.13612.pdf,Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning,"Over the past year, a large body of multimodal research has emerged around +zero-shot evaluation using GPT descriptors. These studies boost the zero-shot +accuracy of pretrained VL models with an ensemble of label-specific text +generated by GPT. A recent study, WaffleCLIP, demonstrated that similar +zero-shot accuracy can be achieved with an ensemble of random descriptors. +However, both zero-shot methods are un-trainable and consequently sub-optimal +when some few-shot out-of-distribution (OOD) training data is available. +Inspired by these prior works, we present two more flexible methods called +descriptor and word soups, which do not require an LLM at test time and can +leverage training data to increase OOD target accuracy. Descriptor soup +greedily selects a small set of textual descriptors using generic few-shot +training data, then calculates robust class embeddings using the selected +descriptors. Word soup greedily assembles a chain of words in a similar manner. +Compared to existing few-shot soft prompt tuning methods, word soup requires +fewer parameters by construction and less GPU memory, since it does not require +backpropagation. Both soups outperform current published few-shot methods, even +when combined with SoTA zero-shot methods, on cross-dataset and domain +generalization benchmarks. Compared with SoTA prompt and descriptor ensembling +methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy +with fewer ensemble members. Please checkout our code: +github.com/Chris210634/word_soups",cs.CV,['cs.CV'] +ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis,Muhammad Hamza Mughal · Rishabh Dabral · Ikhsanul Habibie · Lucia Donatelli · Marc Habermann · Christian Theobalt, ,https://arxiv.org/abs/2403.17936,,2403.17936.pdf,ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis,"Gestures play a key role in human communication. Recent methods for co-speech +gesture generation, while managing to generate beat-aligned motions, struggle +generating gestures that are semantically aligned with the utterance. Compared +to beat gestures that align naturally to the audio signal, semantically +coherent gestures require modeling the complex interactions between the +language and human motion, and can be controlled by focusing on certain words. +Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal +gesture synthesis, which can not only generate gestures based on multi-modal +speech inputs, but can also facilitate controllability in gesture synthesis. +Our method proposes two guidance objectives that allow the users to modulate +the impact of different conditioning modalities (e.g. audio vs text) as well as +to choose certain words to be emphasized during gesturing. Our method is +versatile in that it can be trained either for generating monologue gestures or +even the conversational gestures. To further advance the research on +multi-party interactive gestures, the DnD Group Gesture dataset is released, +which contains 6 hours of gesture data showing 5 people interacting with one +another. We compare our method with several recent works and demonstrate +effectiveness of our method on a variety of tasks. We urge the reader to watch +our supplementary video at our website.",cs.CV,['cs.CV'] +SEAS: ShapE-Aligned Supervision for Person Re-Identification,Haidong Zhu · Pranav Budhwant · Zhaoheng Zheng · Ram Nevatia, ,https://arxiv.org/abs/2312.05634,,2312.05634.pdf,PGDS: Pose-Guidance Deep Supervision for Mitigating Clothes-Changing in Person Re-Identification,"Person Re-Identification (Re-ID) task seeks to enhance the tracking of +multiple individuals by surveillance cameras. It supports multimodal tasks, +including text-based person retrieval and human matching. One of the most +significant challenges faced in Re-ID is clothes-changing, where the same +person may appear in different outfits. While previous methods have made +notable progress in maintaining clothing data consistency and handling clothing +change data, they still rely excessively on clothing information, which can +limit performance due to the dynamic nature of human appearances. To mitigate +this challenge, we propose the Pose-Guidance Deep Supervision (PGDS), an +effective framework for learning pose guidance within the Re-ID task. It +consists of three modules: a human encoder, a pose encoder, and a Pose-to-Human +Projection module (PHP). Our framework guides the human encoder, i.e., the main +re-identification model, with pose information from the pose encoder through +multiple layers via the knowledge transfer mechanism from the PHP module, +helping the human encoder learn body parts information without increasing +computation resources in the inference stage. Through extensive experiments, +our method surpasses the performance of current state-of-the-art methods, +demonstrating its robustness and effectiveness for real-world applications. Our +code is available at https://github.com/huyquoctrinh/PGDS.",cs.CV,['cs.CV'] +Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling,Olaf Dünkel · Tim Salzmann �� Florian Pfaff, ,https://arxiv.org/abs/2404.05675,,2404.05675.pdf,Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling,"Normalizing flows have proven their efficacy for density estimation in +Euclidean space, but their application to rotational representations, crucial +in various domains such as robotics or human pose modeling, remains +underexplored. Probabilistic models of the human pose can benefit from +approaches that rigorously consider the rotational nature of human joints. For +this purpose, we introduce HuProSO3, a normalizing flow model that operates on +a high-dimensional product space of SO(3) manifolds, modeling the joint +distribution for human joints with three degrees of freedom. HuProSO3's +advantage over state-of-the-art approaches is demonstrated through its superior +modeling accuracy in three different applications and its capability to +evaluate the exact likelihood. This work not only addresses the technical +challenge of learning densities on SO(3) manifolds, but it also has broader +implications for domains where the probabilistic regression of correlated 3D +rotations is of importance.",cs.CV,['cs.CV'] +ES$^3$: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations,Yuanhang Zhang · Shuang Yang · Shiguang Shan · Xilin Chen, ,https://arxiv.org/abs/2312.10305,,2312.10305.pdf,Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction,"Speech signals are inherently complex as they encompass both global acoustic +characteristics and local semantic information. However, in the task of target +speech extraction, certain elements of global and local semantic information in +the reference speech, which are irrelevant to speaker identity, can lead to +speaker confusion within the speech extraction network. To overcome this +challenge, we propose a self-supervised disentangled representation learning +method. Our approach tackles this issue through a two-phase process, utilizing +a reference speech encoding network and a global information disentanglement +network to gradually disentangle the speaker identity information from other +irrelevant factors. We exclusively employ the disentangled speaker identity +information to guide the speech extraction network. Moreover, we introduce the +adaptive modulation Transformer to ensure that the acoustic representation of +the mixed signal remains undisturbed by the speaker embeddings. This component +incorporates speaker embeddings as conditional information, facilitating +natural and efficient guidance for the speech extraction network. Experimental +results substantiate the effectiveness of our meticulously crafted approach, +showcasing a substantial reduction in the likelihood of speaker confusion.",cs.SD,"['cs.SD', 'cs.AI', 'cs.LG', 'eess.AS']" +Video Interpolation with Diffusion Models,Siddhant Jain · Daniel Watson · Aleksander Holynski · Eric Tabellion · Ben Poole · Janne Kontkanen,https://vidim-interpolation.github.io/,https://arxiv.org/abs/2404.01203,,2404.01203.pdf,Video Interpolation with Diffusion Models,"We present VIDIM, a generative model for video interpolation, which creates +short videos given a start and end frame. In order to achieve high fidelity and +generate motions unseen in the input data, VIDIM uses cascaded diffusion models +to first generate the target video at low resolution, and then generate the +high-resolution video conditioned on the low-resolution generated video. We +compare VIDIM to previous state-of-the-art methods on video interpolation, and +demonstrate how such works fail in most settings where the underlying motion is +complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We +additionally demonstrate how classifier-free guidance on the start and end +frame and conditioning the super-resolution model on the original +high-resolution frames without additional parameters unlocks high-fidelity +results. VIDIM is fast to sample from as it jointly denoises all the frames to +be generated, requires less than a billion parameters per diffusion model to +produce compelling results, and still enjoys scalability and improved quality +at larger parameter counts.",cs.CV,['cs.CV'] +Infer from What You Have Seen Before: Temporally-dependent Classifier for Semi-supervised Video Semantic Segmentation,Jiafan Zhuang · Zilei Wang · Yixin Zhang · Zhun Fan, ,,https://www.youtube.com/watch?v=k50sUgxC09o,,,,,nan +IReNe: Instant Recoloring of Neural Radiance Fields,Alessio Mazzucchelli · Adrian Garcia-Garcia · Elena Garces · Fernando Rivas-Manzaneque · Francesc Moreno-Noguer · Adrian Penate-Sanchez,https://iviazz97.github.io/irene/,https://arxiv.org/abs/2405.19876,,2405.19876.pdf,IReNe: Instant Recoloring in Neural Radiance Fields,"Advances in NERFs have allowed for 3D scene reconstructions and novel view +synthesis. Yet, efficiently editing these representations while retaining +photorealism is an emerging challenge. Recent methods face three primary +limitations: they're slow for interactive use, lack precision at object +boundaries, and struggle to ensure multi-view consistency. We introduce IReNe +to address these limitations, enabling swift, near real-time color editing in +NeRF. Leveraging a pre-trained NeRF model and a single training image with +user-applied color edits, IReNe swiftly adjusts network parameters in seconds. +This adjustment allows the model to generate new scene views, accurately +representing the color changes from the training image while also controlling +object boundaries and view-specific effects. Object boundary control is +achieved by integrating a trainable segmentation module into the model. The +process gains efficiency by retraining only the weights of the last network +layer. We observed that neurons in this layer can be classified into those +responsible for view-dependent appearance and those contributing to diffuse +appearance. We introduce an automated classification approach to identify these +neuron types and exclusively fine-tune the weights of the diffuse neurons. This +further accelerates training and ensures consistent color edits across +different views. A thorough validation on a new dataset, with edited object +colors, shows significant quantitative and qualitative advancements over +competitors, accelerating speeds by 5x to 500x.",cs.CV,['cs.CV'] +FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization,Shuai Tan · Bin Ji · Ye Pan, ,https://arxiv.org/abs/2403.06375,,2403.06375.pdf,FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization,"Generating emotional talking faces is a practical yet challenging endeavor. +To create a lifelike avatar, we draw upon two critical insights from a human +perspective: 1) The connection between audio and the non-deterministic facial +dynamics, encompassing expressions, blinks, poses, should exhibit synchronous +and one-to-many mapping. 2) Vibrant expressions are often accompanied by +emotion-aware high-definition (HD) textures and finely detailed teeth. However, +both aspects are frequently overlooked by existing methods. To this end, this +paper proposes using normalizing Flow and Vector-Quantization modeling to +produce emotional talking faces that satisfy both insights concurrently +(FlowVQTalker). Specifically, we develop a flow-based coefficient generator +that encodes the dynamics of facial emotion into a multi-emotion-class latent +space represented as a mixture distribution. The generation process commences +with random sampling from the modeled distribution, guided by the accompanying +audio, enabling both lip-synchronization and the uncertain nonverbal facial +cues generation. Furthermore, our designed vector-quantization image generator +treats the creation of expressive facial images as a code query task, utilizing +a learned codebook to provide rich, high-quality textures that enhance the +emotional perception of the results. Extensive experiments are conducted to +showcase the effectiveness of our approach.",cs.CV,['cs.CV'] +Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations,Kewei Wang · Yizheng Wu · Jun Cen · Zhiyu Pan · Xingyi Li · Zhe Wang · Zhiguo Cao · Guosheng Lin, ,https://arxiv.org/abs/2403.13261,,2403.13261.pdf,Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations,"The perception of motion behavior in a dynamic environment holds significant +importance for autonomous driving systems, wherein class-agnostic motion +prediction methods directly predict the motion of the entire point cloud. While +most existing methods rely on fully-supervised learning, the manual labeling of +point cloud data is laborious and time-consuming. Therefore, several +annotation-efficient methods have been proposed to address this challenge. +Although effective, these methods rely on weak annotations or additional +multi-modal data like images, and the potential benefits inherent in the point +cloud sequence are still underexplored. To this end, we explore the feasibility +of self-supervised motion prediction with only unlabeled LiDAR point clouds. +Initially, we employ an optimal transport solver to establish coarse +correspondences between current and future point clouds as the coarse pseudo +motion labels. Training models directly using such coarse labels leads to +noticeable spatial and temporal prediction inconsistencies. To mitigate these +issues, we introduce three simple spatial and temporal regularization losses, +which facilitate the self-supervised training process effectively. Experimental +results demonstrate the significant superiority of our approach over the +state-of-the-art self-supervised methods.",cs.CV,['cs.CV'] +Latency Correction for Event-guided Deblurring and Frame Interpolation,Yixin Yang · Jinxiu Liang · Bohan Yu · Yan Chen · Jimmy S. Ren · Boxin Shi, ,https://arxiv.org/abs/2306.15507,,2306.15507.pdf,Self-supervised Learning of Event-guided Video Frame Interpolation for Rolling Shutter Frames,"This paper makes the first attempt to tackle the challenging task of +recovering arbitrary frame rate latent global shutter (GS) frames from two +consecutive rolling shutter (RS) frames, guided by the novel event camera data. +Although events possess high temporal resolution, beneficial for video frame +interpolation (VFI), a hurdle in tackling this task is the lack of paired GS +frames. Another challenge is that RS frames are susceptible to distortion when +capturing moving objects. To this end, we propose a novel self-supervised +framework that leverages events to guide RS frame correction and VFI in a +unified framework. Our key idea is to estimate the displacement field (DF) +non-linear dense 3D spatiotemporal information of all pixels during the +exposure time, allowing for the reciprocal reconstruction between RS and GS +frames as well as arbitrary frame rate VFI. Specifically, the displacement +field estimation (DFE) module is proposed to estimate the spatiotemporal motion +from events to correct the RS distortion and interpolate the GS frames in one +step. We then combine the input RS frames and DF to learn a mapping for +RS-to-GS frame interpolation. However, as the mapping is highly +under-constrained, we couple it with an inverse mapping (i.e., GS-to-RS) and RS +frame warping (i.e., RS-to-RS) for self-supervision. As there is a lack of +labeled datasets for evaluation, we generate two synthetic datasets and collect +a real-world dataset to train and test our method. Experimental results show +that our method yields comparable or better performance with prior supervised +methods.",cs.CV,"['cs.CV', 'cs.RO']" +Single Domain Generalization for Crowd Counting,Zhuoxuan Peng · S.-H. Gary Chan,https://github.com/Shimmer93/MPCount,https://arxiv.org/abs/2403.09124,,2403.09124.pdf,Single Domain Generalization for Crowd Counting,"Due to its promising results, density map regression has been widely employed +for image-based crowd counting. The approach, however, often suffers from +severe performance degradation when tested on data from unseen scenarios, the +so-called ""domain shift"" problem. To address the problem, we investigate in +this work single domain generalization (SDG) for crowd counting. The existing +SDG approaches are mainly for image classification and segmentation, and can +hardly be extended to our case due to its regression nature and label ambiguity +(i.e., ambiguous pixel-level ground truths). We propose MPCount, a novel +effective SDG approach even for narrow source distribution. MPCount stores +diverse density values for density map regression and reconstructs +domain-invariant features by means of only one memory bank, a content error +mask and attention consistency loss. By partitioning the image into grids, it +employs patch-wise classification as an auxiliary task to mitigate label +ambiguity. Through extensive experiments on different datasets, MPCount is +shown to significantly improve counting accuracy compared to the state of the +art under diverse scenarios unobserved in the training data characterized by +narrow source distribution. Code is available at +https://github.com/Shimmer93/MPCount.",cs.CV,['cs.CV'] +Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild,Fanghua Yu · Jinjin Gu · Zheyuan Li · Jinfan Hu · Xiangtao Kong · Xintao Wang · Jingwen He · Yu Qiao · Chao Dong, ,https://arxiv.org/abs/2401.13627,,2401.13627.pdf,Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild,"We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image +restoration method that harnesses generative prior and the power of model +scaling up. Leveraging multi-modal techniques and advanced generative prior, +SUPIR marks a significant advance in intelligent and realistic image +restoration. As a pivotal catalyst within SUPIR, model scaling dramatically +enhances its capabilities and demonstrates new potential for image restoration. +We collect a dataset comprising 20 million high-resolution, high-quality images +for model training, each enriched with descriptive text annotations. SUPIR +provides the capability to restore images guided by textual prompts, broadening +its application scope and potential. Moreover, we introduce negative-quality +prompts to further improve perceptual quality. We also develop a +restoration-guided sampling method to suppress the fidelity issue encountered +in generative-based restoration. Experiments demonstrate SUPIR's exceptional +restoration effects and its novel capacity to manipulate restoration through +textual prompts.",cs.CV,['cs.CV'] +ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles,Jiawei Zhang · Chejian Xu · Bo Li, ,https://arxiv.org/abs/2405.14062,,2405.14062.pdf,ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles,"We present ChatScene, a Large Language Model (LLM)-based agent that leverages +the capabilities of LLMs to generate safety-critical scenarios for autonomous +vehicles. Given unstructured language instructions, the agent first generates +textually described traffic scenarios using LLMs. These scenario descriptions +are subsequently broken down into several sub-descriptions for specified +details such as behaviors and locations of vehicles. The agent then +distinctively transforms the textually described sub-scenarios into +domain-specific languages, which then generate actual code for prediction and +control in simulators, facilitating the creation of diverse and complex +scenarios within the CARLA simulation environment. A key part of our agent is a +comprehensive knowledge retrieval component, which efficiently translates +specific textual descriptions into corresponding domain-specific code snippets +by training a knowledge database containing the scenario description and code +pairs. Extensive experimental results underscore the efficacy of ChatScene in +improving the safety of autonomous vehicles. For instance, the scenarios +generated by ChatScene show a 15% increase in collision rates compared to +state-of-the-art baselines when tested against different reinforcement +learning-based ego vehicles. Furthermore, we show that by using our generated +safety-critical scenarios to fine-tune different RL-based autonomous driving +models, they can achieve a 9% reduction in collision rates, surpassing current +SOTA methods. ChatScene effectively bridges the gap between textual +descriptions of traffic scenarios and practical CARLA simulations, providing a +unified way to conveniently generate safety-critical scenarios for safety +testing and improvement for AVs.",cs.AI,"['cs.AI', 'cs.LG']" +KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation,Jihua Peng · Yanghong Zhou · Tracy P Y Mok, ,https://arxiv.org/abs/2404.00658,,2404.00658.pdf,KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation,"This paper presents a novel Kinematics and Trajectory Prior +Knowledge-Enhanced Transformer (KTPFormer), which overcomes the weakness in +existing transformer-based methods for 3D human pose estimation that the +derivation of Q, K, V vectors in their self-attention mechanisms are all based +on simple linear mapping. We propose two prior attention modules, namely +Kinematics Prior Attention (KPA) and Trajectory Prior Attention (TPA) to take +advantage of the known anatomical structure of the human body and motion +trajectory information, to facilitate effective learning of global dependencies +and features in the multi-head self-attention. KPA models kinematic +relationships in the human body by constructing a topology of kinematics, while +TPA builds a trajectory topology to learn the information of joint motion +trajectory across frames. Yielding Q, K, V vectors with prior knowledge, the +two modules enable KTPFormer to model both spatial and temporal correlations +simultaneously. Extensive experiments on three benchmarks (Human3.6M, +MPI-INF-3DHP and HumanEva) show that KTPFormer achieves superior performance in +comparison to state-of-the-art methods. More importantly, our KPA and TPA +modules have lightweight plug-and-play designs and can be integrated into +various transformer-based networks (i.e., diffusion-based) to improve the +performance with only a very small increase in the computational overhead. The +code is available at: https://github.com/JihuaPeng/KTPFormer.",cs.CV,['cs.CV'] +Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification,Pingping Zhang · Yuhao Wang · Yang Liu · Zhengzheng Tu · Huchuan Lu,https://github.com/924973292/EDITOR,https://arxiv.org/abs/2403.10254,,2403.10254.pdf,Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification,"Single-modal object re-identification (ReID) faces great challenges in +maintaining robustness within complex visual scenarios. In contrast, +multi-modal object ReID utilizes complementary information from diverse +modalities, showing great potentials for practical applications. However, +previous methods may be easily affected by irrelevant backgrounds and usually +ignore the modality gaps. To address above issues, we propose a novel learning +framework named \textbf{EDITOR} to select diverse tokens from vision +Transformers for multi-modal object ReID. We begin with a shared vision +Transformer to extract tokenized features from different input modalities. +Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to +adaptively select object-centric tokens with both spatial and frequency +information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) +module to facilitate feature interactions within and across modalities. +Finally, to further reduce the effect of backgrounds, we propose a Background +Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). +They are formulated as two new loss functions, which improve the feature +discrimination with background suppression. As a result, our framework can +generate more discriminative features for multi-modal object ReID. Extensive +experiments on three multi-modal ReID benchmarks verify the effectiveness of +our methods. The code is available at https://github.com/924973292/EDITOR.",cs.CV,"['cs.CV', 'cs.IR', 'cs.MM']" +ShapeWalk: Compositional Shape Editing through Language-Guided Chains,Habib Slim · Mohamed Elhoseiny,https://shapewalk.github.io/,https://arxiv.org/html/2405.20319v1,,2405.20319v1.pdf,ParSEL: Parameterized Shape Editing with Language,"The ability to edit 3D assets from natural language presents a compelling +paradigm to aid in the democratization of 3D content creation. However, while +natural language is often effective at communicating general intent, it is +poorly suited for specifying precise manipulation. To address this gap, we +introduce ParSEL, a system that enables controllable editing of high-quality 3D +assets from natural language. Given a segmented 3D mesh and an editing request, +ParSEL produces a parameterized editing program. Adjusting the program +parameters allows users to explore shape variations with a precise control over +the magnitudes of edits. To infer editing programs which align with an input +edit request, we leverage the abilities of large-language models (LLMs). +However, while we find that LLMs excel at identifying initial edit operations, +they often fail to infer complete editing programs, and produce outputs that +violate shape semantics. To overcome this issue, we introduce Analytical Edit +Propagation (AEP), an algorithm which extends a seed edit with additional +operations until a complete editing program has been formed. Unlike prior +methods, AEP searches for analytical editing operations compatible with a range +of possible user edits through the integration of computer algebra systems for +geometric analysis. Experimentally we demonstrate ParSEL's effectiveness in +enabling controllable editing of 3D objects through natural language requests +over alternative system designs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.HC', 'cs.SC']" +Video-P2P: Video Editing with Cross-attention Control,Shaoteng Liu · Yuechen Zhang · Wenbo Li · Zhe Lin · Jiaya Jia, ,,https://www.researchgate.net/publication/380733385_Video-P2P_Video_Editing_with_Cross-attention_Control,,,,,nan +R-Cyclic Diffuser: Reductive and Cyclic Latent Diffusion for 3D Clothed Human Digitalization,Kennard Chan · Fayao Liu · Guosheng Lin · Chuan-Sheng Foo · Weisi Lin, ,https://arxiv.org/html/2401.12175v2,,2401.12175v2.pdf,Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM,"Reconstructing 3D humans from a single image has been extensively +investigated. However, existing approaches often fall short on capturing fine +geometry and appearance details, hallucinating occluded parts with plausible +details, and achieving generalization across unseen and in-the-wild datasets. +We present Human-LRM, a diffusion-guided feed-forward model that predicts the +implicit field of a human from a single image. Leveraging the power of the +state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e +Stable Diffusion), our method is able to capture human without any template +prior, e.g., SMPL, and effectively enhance occluded parts with rich and +realistic details. Our approach first uses a single-view LRM model with an +enhanced geometry decoder to get the triplane NeRF representation. The novel +view renderings from the triplane NeRF provide strong geometry and color prior, +from which we generate photo-realistic details for the occluded parts using a +diffusion model. The generated multiple views then enable reconstruction with +high-quality geometry and appearance, leading to superior overall performance +comparing to all existing human reconstruction methods.",cs.CV,['cs.CV'] +Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model,Wenfeng Song · Xingliang Jin · Shuai Li · Chenglizhao Chen · Aimin Hao · Xia HOU · Ning Li · Hong Qin,https://xingliangjin.github.io/MCM-LDM-Web/,https://arxiv.org/abs/2306.09330,,2306.09330.pdf,ArtFusion: Controllable Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models,"Arbitrary Style Transfer (AST) aims to transform images by adopting the style +from any selected artwork. Nonetheless, the need to accommodate diverse and +subjective user preferences poses a significant challenge. While some users +wish to preserve distinct content structures, others might favor a more +pronounced stylization. Despite advances in feed-forward AST methods, their +limited customizability hinders their practical application. We propose a new +approach, ArtFusion, which provides a flexible balance between content and +style. In contrast to traditional methods reliant on biased similarity losses, +ArtFusion utilizes our innovative Dual Conditional Latent Diffusion +Probabilistic Models (Dual-cLDM). This approach mitigates repetitive patterns +and enhances subtle artistic aspects like brush strokes and genre-specific +features. Despite the promising results of conditional diffusion probabilistic +models (cDM) in various generative tasks, their introduction to style transfer +is challenging due to the requirement for paired training data. ArtFusion +successfully navigates this issue, offering more practical and controllable +stylization. A key element of our approach involves using a single image for +both content and style during model training, all the while maintaining +effective stylization during inference. ArtFusion outperforms existing +approaches on outstanding controllability and faithful presentation of artistic +details, providing evidence of its superior style transfer capabilities. +Furthermore, the Dual-cLDM utilized in ArtFusion carries the potential for a +variety of complex multi-condition generative tasks, thus greatly broadening +the impact of our research.",cs.CV,['cs.CV'] +HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting,Hongyu Zhou · Jiahao Shao · Lu Xu · Dongfeng Bai · Weichao Qiu · Bingbing Liu · Yue Wang · Andreas Geiger · Yiyi Liao, ,https://arxiv.org/abs/2403.12722,,2403.12722.pdf,HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting,"Holistic understanding of urban scenes based on RGB images is a challenging +yet important problem. It encompasses understanding both the geometry and +appearance to enable novel view synthesis, parsing semantic labels, and +tracking moving objects. Despite considerable progress, existing approaches +often focus on specific aspects of this task and require additional inputs such +as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we +introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic +urban scene understanding. Our main idea involves the joint optimization of +geometry, appearance, semantics, and motion using a combination of static and +dynamic 3D Gaussians, where moving object poses are regularized via physical +constraints. Our approach offers the ability to render new viewpoints in +real-time, yielding 2D and 3D semantic information with high accuracy, and +reconstruct dynamic scenes, even in scenarios where 3D bounding box detection +are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2 +demonstrate the effectiveness of our approach.",cs.CV,['cs.CV'] +HumMUSS: Human Motion Understanding using State Space Models,Arnab Mondal · Stefano Alletto · Denis Tome, ,https://arxiv.org/abs/2404.10880,,2404.10880.pdf,HumMUSS: Human Motion Understanding using State Space Models,"Understanding human motion from video is essential for a range of +applications, including pose estimation, mesh recovery and action recognition. +While state-of-the-art methods predominantly rely on transformer-based +architectures, these approaches have limitations in practical scenarios. +Transformers are slower when sequentially predicting on a continuous stream of +frames in real-time, and do not generalize to new frame rates. In light of +these constraints, we propose a novel attention-free spatiotemporal model for +human motion understanding building upon recent advancements in state space +models. Our model not only matches the performance of transformer-based models +in various motion understanding tasks but also brings added benefits like +adaptability to different video frame rates and enhanced training speed when +working with longer sequence of keypoints. Moreover, the proposed model +supports both offline and real-time applications. For real-time sequential +prediction, our model is both memory efficient and several times faster than +transformer-based approaches while maintaining their high accuracy.",cs.CV,"['cs.CV', 'cs.AI']" +SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control,Jaskirat Singh · Jianming Zhang · Qing Liu · Cameron Smith · Zhe Lin · Liang Zheng, ,https://arxiv.org/abs/2312.05039,,,SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control,"The field of generative image inpainting and object insertion has made +significant progress with the recent advent of latent diffusion models. +Utilizing a precise object mask can greatly enhance these applications. +However, due to the challenges users encounter in creating high-fidelity masks, +there is a tendency for these methods to rely on more coarse masks (e.g., +bounding box) for these applications. This results in limited control and +compromised background content preservation. To overcome these limitations, we +introduce SmartMask, which allows any novice user to create detailed masks for +precise object insertion. Combined with a ControlNet-Inpaint model, our +experiments demonstrate that SmartMask achieves superior object insertion +quality, preserving the background content more effectively than previous +methods. Notably, unlike prior works the proposed approach can also be used +even without user-mask guidance, which allows it to perform mask-free object +insertion at diverse positions and scales. Furthermore, we find that when used +iteratively with a novel instruction-tuning based planning model, SmartMask can +be used to design detailed layouts from scratch. As compared with user-scribble +based layout design, we observe that SmartMask allows for better quality +outputs with layout-to-image generation methods. Project page is available at +https://smartmask-gen.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.HC', 'cs.LG', 'cs.MM']" +Towards Progressive Multi-Frequency Representation for Image Warping,Jun Xiao · Zihang Lyu · Cong Zhang · Yakun Ju · Changjian Shui · Kin-man Lam, ,https://arxiv.org/abs/2404.10716,,2404.10716.pdf,MOWA: Multiple-in-One Image Warping Model,"While recent image warping approaches achieved remarkable success on existing +benchmarks, they still require training separate models for each specific task +and cannot generalize well to different camera models or customized +manipulations. To address diverse types of warping in practice, we propose a +Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we +mitigate the difficulty of multi-task learning by disentangling the motion +estimation at both the region level and pixel level. To further enable dynamic +task-aware image warping, we introduce a lightweight point-based classifier +that predicts the task type, serving as prompts to modulate the feature maps +for better estimation. To our knowledge, this is the first work that solves +multiple practical warping tasks in one single model. Extensive experiments +demonstrate that our MOWA, which is trained on six tasks for multiple-in-one +image warping, outperforms state-of-the-art task-specific models across most +tasks. Moreover, MOWA also exhibits promising potential to generalize into +unseen scenes, as evidenced by cross-domain and zero-shot evaluations. The code +will be made publicly available.",cs.CV,['cs.CV'] +V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,Penghao Wu · Saining Xie,https://vstar-seal.github.io/,https://arxiv.org/abs/2312.14135,,2312.14135.pdf,V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,"When we look around and perform complex tasks, how we see and selectively +process what we see is crucial. However, the lack of this visual search +mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on +important visual details, especially when handling high-resolution and visually +crowded images. To address this, we introduce V*, an LLM-guided visual search +mechanism that employs the world knowledge in LLMs for efficient visual +querying. When combined with an MLLM, this mechanism enhances collaborative +reasoning, contextual understanding, and precise targeting of specific visual +elements. This integration results in a new MLLM meta-architecture, named Show, +sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically +designed to evaluate MLLMs in their ability to process high-resolution images +and focus on visual details. Our study highlights the necessity of +incorporating visual search capabilities into multimodal systems. The code is +available https://github.com/penghao-wu/vstar.",cs.CV,['cs.CV'] +Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models,Xinpeng Ding · Jianhua Han · Hang Xu · Xiaodan Liang · Wei Zhang · Xiaomeng Li, ,https://arxiv.org/abs/2401.00988v1,,2401.00988v1.pdf,Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models,"The rise of multimodal large language models (MLLMs) has spurred interest in +language-based driving tasks. However, existing research typically focuses on +limited tasks and often omits key multi-view and temporal information which is +crucial for robust autonomous driving. To bridge these gaps, we introduce +NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 +subtasks, where each task demands holistic information (e.g., temporal, +multi-view, and spatial), significantly elevating the challenge level. To +obtain NuInstruct, we propose a novel SQL-based method to generate +instruction-response pairs automatically, which is inspired by the driving +logical progression of humans. We further present BEV-InMLLM, an end-to-end +method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) +features, language-aligned for large language models. BEV-InMLLM integrates +multi-view, spatial awareness, and temporal semantics to enhance MLLMs' +capabilities on NuInstruct tasks. Moreover, our proposed BEV injection module +is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct +demonstrate that BEV-InMLLM significantly outperforms existing MLLMs, e.g. +around 9% improvement on various tasks. We plan to release our NuInstruct for +future research development.",cs.CV,['cs.CV'] +Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation,Xinyao Li · Yuke Li · Zhekai Du · Fengling Li · Ke Lu · Jingjing Li,https://github.com/TL-UESTC/UniMoS,https://arxiv.org/abs/2403.06946,,2403.06946.pdf,Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation,"Large vision-language models (VLMs) like CLIP have demonstrated good +zero-shot learning performance in the unsupervised domain adaptation task. Yet, +most transfer approaches for VLMs focus on either the language or visual +branches, overlooking the nuanced interplay between both modalities. In this +work, we introduce a Unified Modality Separation (UniMoS) framework for +unsupervised domain adaptation. Leveraging insights from modality gap studies, +we craft a nimble modality separation network that distinctly disentangles +CLIP's features into language-associated and vision-associated components. Our +proposed Modality-Ensemble Training (MET) method fosters the exchange of +modality-agnostic information while maintaining modality-specific nuances. We +align features across domains using a modality discriminator. Comprehensive +evaluations on three benchmarks reveal our approach sets a new state-of-the-art +with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS",cs.CV,['cs.CV'] +Relation Rectification in Diffusion Model,Yinwei Wu · Xingyi Yang · Xinchao Wang,https://wuyinwei-hah.github.io/rrnet.github.io/,https://arxiv.org/abs/2403.20249,,2403.20249.pdf,Relation Rectification in Diffusion Model,"Despite their exceptional generative abilities, large text-to-image diffusion +models, much like skilled but careless artists, often struggle with accurately +depicting visual relationships between objects. This issue, as we uncover +through careful analysis, arises from a misaligned text encoder that struggles +to interpret specific relationships and differentiate the logical order of +associated objects. To resolve this, we introduce a novel task termed Relation +Rectification, aiming to refine the model to accurately represent a given +relationship it initially fails to generate. To address this, we propose an +innovative solution utilizing a Heterogeneous Graph Convolutional Network +(HGCN). It models the directional relationships between relation terms and +corresponding objects within the input prompts. Specifically, we optimize the +HGCN on a pair of prompts with identical relational words but reversed object +orders, supplemented by a few reference images. The lightweight HGCN adjusts +the text embeddings generated by the text encoder, ensuring the accurate +reflection of the textual relation in the embedding space. Crucially, our +method retains the parameters of the text encoder and diffusion model, +preserving the model's robust performance on unrelated descriptions. We +validated our approach on a newly curated dataset of diverse relational data, +demonstrating both quantitative and qualitative enhancements in generating +images with precise visual relations. Project page: +https://wuyinwei-hah.github.io/rrnet.github.io/.",cs.CV,['cs.CV'] +CoralSCOP: Segment any COral Image on this Planet,"Zheng Ziqiang · Liang Haixin · Binh-Son Hua · Tim, Yue Him Wong · Put ANG · Apple CHUI · Sai-Kit Yeung", ,,https://ais.hkust.edu.hk/whats-happening/news/isd-research-team-produces-first-model-segment-and-generalize-coral-reef-image,,,,,nan +Category-Level Multi-Part Multi-Joint 3D Shape Assembly,Yichen Li · Kaichun Mo · Yueqi Duan · He Wang · Jiequan Zhang · Lin Shao · Wojciech Matusik · Leonidas Guibas, ,,,,,,,nan +AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings,Jamie Watson · Filippo Aleotti · Mohamed Sayed · Zawar Qureshi · Oisin Mac Aodha · Gabriel J. Brostow · Michael Firman · Sara Vicente,https://nianticlabs.github.io/airplanes/,,https://link.springer.com/article/10.1007/s00371-023-03110-7,,,,,nan +Fun with Flags: Robust Principal Directions via Flag Manifolds,Tolga Birdal · Nathan Mankovich, ,https://arxiv.org/abs/2401.04071v1,,2401.04071v1.pdf,Fun with Flags: Robust Principal Directions via Flag Manifolds,"Principal component analysis (PCA), along with its extensions to manifolds +and outlier contaminated data, have been indispensable in computer vision and +machine learning. In this work, we present a unifying formalism for PCA and its +variants, and introduce a framework based on the flags of linear subspaces, \ie +a hierarchy of nested linear subspaces of increasing dimension, which not only +allows for a common implementation but also yields novel variants, not explored +previously. We begin by generalizing traditional PCA methods that either +maximize variance or minimize reconstruction error. We expand these +interpretations to develop a wide array of new dimensionality reduction +algorithms by accounting for outliers and the data manifold. To devise a common +computational approach, we recast robust and dual forms of PCA as optimization +problems on flag manifolds. We then integrate tangent space approximations of +principal geodesic analysis (tangent-PCA) into this flag-based framework, +creating novel robust and dual geodesic PCA variations. The remarkable +flexibility offered by the 'flagification' introduced here enables even more +algorithmic variants identified by specific flag types. Last but not least, we +propose an effective convergent solver for these flag-formulations employing +the Stiefel manifold. Our empirical results on both real-world and synthetic +scenarios, demonstrate the superiority of our novel algorithms, especially in +terms of robustness to outliers on manifolds.",cs.CV,"['cs.CV', 'cs.LG', 'math.DG', 'math.OC', 'stat.ML']" +PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors,Haley So · Laurie Bose · Piotr Dudek · Gordon Wetzstein, ,,,,,,,nan +Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations,Chenyu You · Yifei Min · Weicheng Dai · Jasjeet Sekhon · Lawrence Staib · James Duncan, ,https://arxiv.org/abs/2403.07241,,2403.07241.pdf,Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations,"Fine-tuning pre-trained vision-language models, like CLIP, has yielded +success on diverse downstream tasks. However, several pain points persist for +this paradigm: (i) directly tuning entire pre-trained models becomes both +time-intensive and computationally costly. Additionally, these tuned models +tend to become highly specialized, limiting their practicality for real-world +deployment; (ii) recent studies indicate that pre-trained vision-language +classifiers may overly depend on spurious features -- patterns that correlate +with the target in training data, but are not related to the true labeling +function; and (iii) existing studies on mitigating the reliance on spurious +features, largely based on the assumption that we can identify such features, +does not provide definitive assurance for real-world applications. As a +piloting study, this work focuses on exploring mitigating the reliance on +spurious features for CLIP without using any group annotation. To this end, we +systematically study the existence of spurious correlation on CLIP and +CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), +verify that last-layer retraining can greatly improve group robustness on +pretrained CLIP. In view of them, we advocate a lightweight representation +calibration method for fine-tuning CLIP, by first generating a calibration set +using the pretrained CLIP, and then calibrating representations of samples +within this set through contrastive learning, all without the need for group +labels. Extensive experiments and in-depth visualizations on several benchmarks +validate the effectiveness of our proposals, largely reducing reliance and +significantly boosting the model generalization.",cs.CV,"['cs.CV', 'cs.LG']" +Guided Slot Attention for Unsupervised Video Object Segmentation,Minhyeok Lee · Suhwan Cho · Dogyoon Lee · Chaewon Park · Jungho Lee · Sangyoun Lee, ,https://arxiv.org/abs/2309.14786,,,Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation,"Unsupervised video object segmentation (VOS) is a task that aims to detect +the most salient object in a video without external guidance about the object. +To leverage the property that salient objects usually have distinctive +movements compared to the background, recent methods collaboratively use motion +cues extracted from optical flow maps with appearance cues extracted from RGB +images. However, as optical flow maps are usually very relevant to segmentation +masks, the network is easy to be learned overly dependent on the motion cues +during network training. As a result, such two-stream approaches are vulnerable +to confusing motion cues, making their prediction unstable. To relieve this +issue, we design a novel motion-as-option network by treating motion cues as +optional. During network training, RGB images are randomly provided to the +motion encoder instead of optical flow maps, to implicitly reduce motion +dependency of the network. As the learned motion encoder can deal with both RGB +images and optical flow maps, two different predictions can be generated +depending on which source information is used as motion input. In order to +fully exploit this property, we also propose an adaptive output selection +algorithm to adopt optimal prediction result at test time. Our proposed +approach affords state-of-the-art performance on all public benchmark datasets, +even maintaining real-time inference speed.",cs.CV,['cs.CV'] +A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning,Siddharth Srivastava · Gaurav Sharma, ,https://arxiv.org/abs/2310.09276,,2310.09276.pdf,Transformer-based Multimodal Change Detection with Multitask Consistency Constraints,"Change detection plays a fundamental role in Earth observation for analyzing +temporal iterations over time. However, recent studies have largely neglected +the utilization of multimodal data that presents significant practical and +technical advantages compared to single-modal approaches. This research focuses +on leveraging {pre-event} digital surface model (DSM) data and {post-event} +digital aerial images captured at different times for detecting change beyond +2D. We observe that the current change detection methods struggle with the +multitask conflicts between semantic and height change detection tasks. To +address this challenge, we propose an efficient Transformer-based network that +learns shared representation between cross-dimensional inputs through +cross-attention. {It adopts a consistency constraint to establish the +multimodal relationship. Initially, pseudo-changes are derived by employing +height change thresholding. Subsequently, the $L2$ distance between semantic +and pseudo-changes within their overlapping regions is minimized. This +explicitly endows the height change detection (regression task) and semantic +change detection (classification task) with representation consistency.} A +DSM-to-image multimodal dataset encompassing three cities in the Netherlands +was constructed. It lays a new foundation for beyond-2D change detection from +cross-dimensional inputs. Compared to five state-of-the-art change detection +methods, our model demonstrates consistent multitask superiority in terms of +semantic and height change detection. Furthermore, the consistency strategy can +be seamlessly adapted to the other methods, yielding promising improvements.",cs.CV,['cs.CV'] +Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles,Rui Song · Chenwei Liang · Hu Cao · Zhiran Yan · Walter Zimmer · Markus Gross · Andreas Festag · Alois Knoll,https://rruisong.github.io/publications/CoHFF/,https://arxiv.org/abs/2402.07635,,2402.07635.pdf,Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles,"Collaborative perception in automated vehicles leverages the exchange of +information between agents, aiming to elevate perception results. Previous +camera-based collaborative 3D perception methods typically employ 3D bounding +boxes or bird's eye views as representations of the environment. However, these +approaches fall short in offering a comprehensive 3D environmental prediction. +To bridge this gap, we introduce the first method for collaborative 3D semantic +occupancy prediction. Particularly, it improves local 3D semantic occupancy +predictions by hybrid fusion of (i) semantic and occupancy task features, and +(ii) compressed orthogonal attention features shared between vehicles. +Additionally, due to the lack of a collaborative perception dataset designed +for semantic occupancy prediction, we augment a current collaborative +perception dataset to include 3D collaborative semantic occupancy labels for a +more robust evaluation. The experimental findings highlight that: (i) our +collaborative semantic occupancy predictions excel above the results from +single vehicles by over 30%, and (ii) models anchored on semantic occupancy +outpace state-of-the-art collaborative 3D detection techniques in subsequent +perception applications, showcasing enhanced accuracy and enriched +semantic-awareness in road environments.",cs.CV,['cs.CV'] +"EVS-assisted joint Deblurring, Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling",Rui Jiang · Fangwen Tu · Yixuan Long · Aabhaas Vaish · Bowen Zhou · Qinyi Wang · Wei Zhang · Yuntan Fang · Luis Eduardo García Capel · Bo Mu · Tiejun Dai · Andreas Suess, ,https://arxiv.org/abs/2404.18156,,,Event-based Video Frame Interpolation with Edge Guided Motion Refinement,"Video frame interpolation, the process of synthesizing intermediate frames +between sequential video frames, has made remarkable progress with the use of +event cameras. These sensors, with microsecond-level temporal resolution, fill +information gaps between frames by providing precise motion cues. However, +contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often +neglect the fact that event data primarily supply high-confidence features at +scene edges during multi-modal feature fusion, thereby diminishing the role of +event signals in optical flow (OF) estimation and warping refinement. To +address this overlooked aspect, we introduce an end-to-end E-VFI learning +method (referred to as EGMR) to efficiently utilize edge features from event +signals for motion flow and warping enhancement. Our method incorporates an +Edge Guided Attentive (EGA) module, which rectifies estimated video motion +through attentive aggregation based on the local correlation of multi-modal +features in a coarse-to-fine strategy. Moreover, given that event data can +provide accurate visual references at scene edges between consecutive frames, +we introduce a learned visibility map derived from event data to adaptively +mitigate the occlusion problem in the warping refinement process. Extensive +experiments on both synthetic and real datasets show the effectiveness of the +proposed approach, demonstrating its potential for higher quality video frame +interpolation.",cs.CV,['cs.CV'] +EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion,Zehuan Huang · Hao Wen · Junting Dong · Yaohui Wang · Yangguang Li · Xinyuan Chen · Yan-Pei Cao · Ding Liang · Yu Qiao · Bo Dai · Lu Sheng,https://huanngzh.github.io/EpiDiff/,https://arxiv.org/abs/2312.06725,,2312.06725.pdf,EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion,"Generating multiview images from a single view facilitates the rapid +generation of a 3D mesh conditioned on a single image. Recent methods that +introduce 3D global representation into diffusion models have shown the +potential to generate consistent multiviews, but they have reduced generation +speed and face challenges in maintaining generalizability and quality. To +address this issue, we propose EpiDiff, a localized interactive multiview +diffusion model. At the core of the proposed approach is to insert a +lightweight epipolar attention block into the frozen diffusion model, +leveraging epipolar constraints to enable cross-view interaction among feature +maps of neighboring views. The newly initialized 3D modeling module preserves +the original feature distribution of the diffusion model, exhibiting +compatibility with a variety of base diffusion models. Experiments show that +EpiDiff generates 16 multiview images in just 12 seconds, and it surpasses +previous methods in quality evaluation metrics, including PSNR, SSIM and LPIPS. +Additionally, EpiDiff can generate a more diverse distribution of views, +improving the reconstruction quality from generated multiviews. Please see our +project page at https://huanngzh.github.io/EpiDiff/.",cs.CV,['cs.CV'] +GaussianAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh,Jing Wen · Xiaoming Zhao · Jason Ren · Alexander G. Schwing · Shenlong Wang, ,https://arxiv.org/abs/2404.07991,,2404.07991.pdf,GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh,"We introduce GoMAvatar, a novel approach for real-time, memory-efficient, +high-quality animatable human modeling. GoMAvatar takes as input a single +monocular video to create a digital avatar capable of re-articulation in new +poses and real-time rendering from novel viewpoints, while seamlessly +integrating with rasterization-based graphics pipelines. Central to our method +is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering +quality and speed of Gaussian splatting with geometry modeling and +compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap data and +various YouTube videos. GoMAvatar matches or surpasses current monocular human +modeling algorithms in rendering quality and significantly outperforms them in +computational efficiency (43 FPS) while being memory-efficient (3.63 MB per +subject).",cs.CV,['cs.CV'] +Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes,Zhiyuan Yu · Zheng Qin · lintao zheng · Kai Xu, ,https://arxiv.org/abs/2404.04557,,2404.04557.pdf,Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes,"Multi-instance point cloud registration estimates the poses of multiple +instances of a model point cloud in a scene point cloud. Extracting accurate +point correspondence is to the center of the problem. Existing approaches +usually treat the scene point cloud as a whole, overlooking the separation of +instances. Therefore, point features could be easily polluted by other points +from the background or different instances, leading to inaccurate +correspondences oblivious to separate instances, especially in cluttered +scenes. In this work, we propose MIRETR, Multi-Instance REgistration +TRansformer, a coarse-to-fine approach to the extraction of instance-aware +correspondences. At the coarse level, it jointly learns instance-aware +superpoint features and predicts per-instance masks. With instance masks, the +influence from outside of the instance being concerned is minimized, such that +highly reliable superpoint correspondences can be extracted. The superpoint +correspondences are then extended to instance candidates at the fine level +according to the instance masks. At last, an efficient candidate selection and +refinement algorithm is devised to obtain the final registrations. Extensive +experiments on three public benchmarks demonstrate the efficacy of our +approach. In particular, MIRETR outperforms the state of the arts by 16.6 +points on F1 score on the challenging ROBI benchmark. Code and models are +available at https://github.com/zhiyuanYU134/MIRETR.",cs.CV,['cs.CV'] +RTracker: Recoverable Tracking via PN Tree Structured Memory,Yuqing Huang · Xin Li · Zikun Zhou · Yaowei Wang · Zhenyu He · Ming-Hsuan Yang, ,https://arxiv.org/abs/2403.19242,,2403.19242.pdf,RTracker: Recoverable Tracking via PN Tree Structured Memory,"Existing tracking methods mainly focus on learning better target +representation or developing more robust prediction models to improve tracking +performance. While tracking performance has significantly improved, the target +loss issue occurs frequently due to tracking failures, complete occlusion, or +out-of-view situations. However, considerably less attention is paid to the +self-recovery issue of tracking methods, which is crucial for practical +applications. To this end, we propose a recoverable tracking framework, +RTracker, that uses a tree-structured memory to dynamically associate a tracker +and a detector to enable self-recovery ability. Specifically, we propose a +Positive-Negative Tree-structured memory to chronologically store and maintain +positive and negative target samples. Upon the PN tree memory, we develop +corresponding walking rules for determining the state of the target and define +a set of control flows to unite the tracker and the detector in different +tracking scenarios. Our core idea is to use the support samples of positive and +negative target categories to establish a relative distance-based criterion for +a reliable assessment of target loss. The favorable performance in comparison +against the state-of-the-art methods on numerous challenging benchmarks +demonstrates the effectiveness of the proposed algorithm.",cs.CV,['cs.CV'] +Supervised Anomaly Detection for Complex Industrial Images,Aimira Baitieva · David Hurych · Victor Besnier · Olivier BERNARD, ,https://arxiv.org/abs/2405.04953,,2405.04953.pdf,Supervised Anomaly Detection for Complex Industrial Images,"Automating visual inspection in industrial production lines is essential for +increasing product quality across various industries. Anomaly detection (AD) +methods serve as robust tools for this purpose. However, existing public +datasets primarily consist of images without anomalies, limiting the practical +application of AD methods in production settings. To address this challenge, we +present (1) the Valeo Anomaly Dataset (VAD), a novel real-world industrial +dataset comprising 5000 images, including 2000 instances of challenging real +defects across more than 20 subclasses. Acknowledging that traditional AD +methods struggle with this dataset, we introduce (2) Segmentation-based Anomaly +Detector (SegAD). First, SegAD leverages anomaly maps as well as segmentation +maps to compute local statistics. Next, SegAD uses these statistics and an +optional supervised classifier score as input features for a Boosted Random +Forest (BRF) classifier, yielding the final anomaly score. Our SegAD achieves +state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset +(+0.4% AUROC). The code and the models are publicly available.",cs.CV,"['cs.CV', 'cs.LG']" +InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization,Xiefan Guo · Jinlin Liu · Miaomiao Cui · Jiankai Li · Hongyu Yang · Di Huang, ,https://arxiv.org/abs/2404.04650,,2404.04650.pdf,InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization,"Recent strides in the development of diffusion models, exemplified by +advancements such as Stable Diffusion, have underscored their remarkable +prowess in generating visually compelling images. However, the imperative of +achieving a seamless alignment between the generated image and the provided +prompt persists as a formidable challenge. This paper traces the root of these +difficulties to invalid initial noise, and proposes a solution in the form of +Initial Noise Optimization (InitNO), a paradigm that refines this noise. +Considering text prompts, not all random noises are effective in synthesizing +semantically-faithful images. We design the cross-attention response score and +the self-attention conflict score to evaluate the initial noise, bifurcating +the initial latent space into valid and invalid sectors. A strategically +crafted noise optimization pipeline is developed to guide the initial noise +towards valid regions. Our method, validated through rigorous experimentation, +shows a commendable proficiency in generating images in strict accordance with +text prompts. Our code is available at https://github.com/xiefan-guo/initno.",cs.CV,['cs.CV'] +MFP: Making Full use of Probability Maps for Interactive Image Segmentation,Chaewon Lee · Seon-Ho Lee · Chang-Su Kim, ,https://arxiv.org/abs/2404.18448,,2404.18448.pdf,MFP: Making Full Use of Probability Maps for Interactive Image Segmentation,"In recent interactive segmentation algorithms, previous probability maps are +used as network input to help predictions in the current segmentation round. +However, despite the utilization of previous masks, useful information +contained in the probability maps is not well propagated to the current +predictions. In this paper, to overcome this limitation, we propose a novel and +effective algorithm for click-based interactive image segmentation, called MFP, +which attempts to make full use of probability maps. We first modulate previous +probability maps to enhance their representations of user-specified objects. +Then, we feed the modulated probability maps as additional input to the +segmentation network. We implement the proposed MFP algorithm based on the +ResNet-34, HRNet-18, and ViT-B backbones and assess the performance extensively +on various datasets. It is demonstrated that MFP meaningfully outperforms the +existing algorithms using identical backbones. The source codes are available +at https://github.com/cwlee00/MFP.",cs.CV,['cs.CV'] +A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling,Wentao Qu · Yuantian Shao · Lingwu Meng · Xiaoshui Huang · Liang Xiao, ,https://arxiv.org/abs/2312.02719,,2312.02719.pdf,A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling,"Point cloud upsampling (PCU) enriches the representation of raw point clouds, +significantly improving the performance in downstream tasks such as +classification and reconstruction. Most of the existing point cloud upsampling +methods focus on sparse point cloud feature extraction and upsampling module +design. In a different way, we dive deeper into directly modelling the gradient +of data distribution from dense point clouds. In this paper, we proposed a +conditional denoising diffusion probability model (DDPM) for point cloud +upsampling, called PUDM. Specifically, PUDM treats the sparse point cloud as a +condition, and iteratively learns the transformation relationship between the +dense point cloud and the noise. Simultaneously, PUDM aligns with a dual +mapping paradigm to further improve the discernment of point features. In this +context, PUDM enables learning complex geometry details in the ground truth +through the dominant features, while avoiding an additional upsampling module +design. Furthermore, to generate high-quality arbitrary-scale point clouds +during inference, PUDM exploits the prior knowledge of the scale between sparse +point clouds and dense point clouds during training by parameterizing a rate +factor. Moreover, PUDM exhibits strong noise robustness in experimental +results. In the quantitative and qualitative evaluations on PU1K and PUGAN, +PUDM significantly outperformed existing methods in terms of Chamfer Distance +(CD) and Hausdorff Distance (HD), achieving state of the art (SOTA) +performance.",cs.CV,['cs.CV'] +Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment,Aobo Li · Jinjian Wu · Yongxu Liu · Leida Li, ,https://arxiv.org/abs/2405.04167,,2405.04167.pdf,Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment,"The annotation of blind image quality assessment (BIQA) is labor-intensive +and time-consuming, especially for authentic images. Training on synthetic data +is expected to be beneficial, but synthetically trained models often suffer +from poor generalization in real domains due to domain gaps. In this work, we +make a key observation that introducing more distortion types in the synthetic +dataset may not improve or even be harmful to generalizing authentic image +quality assessment. To solve this challenge, we propose distortion-guided +unsupervised domain adaptation for BIQA (DGQA), a novel framework that +leverages adaptive multi-domain selection via prior knowledge from distortion +to match the data distribution between the source domains and the target +domain, thereby reducing negative transfer from the outlier source domains. +Extensive experiments on two cross-domain settings (synthetic distortion to +authentic distortion and synthetic distortion to algorithmic distortion) have +demonstrated the effectiveness of our proposed DGQA. Besides, DGQA is +orthogonal to existing model-based BIQA methods, and can be used in combination +with such models to improve performance with less training data.",cs.CV,"['cs.CV', 'eess.IV']" +Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters,Jiazuo Yu · Yunzhi Zhuge · Lu Zhang · Ping Hu · Dong Wang · Huchuan Lu · You He, ,https://arxiv.org/abs/2403.11549,,2403.11549.pdf,Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters,"Continual learning can empower vision-language models to continuously acquire +new knowledge, without the need for access to the entire historical dataset. +However, mitigating the performance degradation in large-scale models is +non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) +significant computational burdens associated with full-model tuning. In this +work, we present a parameter-efficient continual learning framework to +alleviate long-term forgetting in incremental learning with vision-language +models. Our approach involves the dynamic expansion of a pre-trained CLIP +model, through the integration of Mixture-of-Experts (MoE) adapters in response +to new tasks. To preserve the zero-shot recognition capability of +vision-language models, we further introduce a Distribution Discriminative +Auto-Selector (DDAS) that automatically routes in-distribution and +out-of-distribution inputs to the MoE Adapter and the original CLIP, +respectively. Through extensive experiments across various settings, our +proposed method consistently outperforms previous state-of-the-art approaches +while concurrently reducing parameter training burdens by 60%. Our code locates +at https://github.com/JiazuoYu/MoE-Adapters4CL",cs.CV,['cs.CV'] +Unsupervised Blind Image Deblurring Based on Self-Enhancement,Lufei Chen · Xiangpeng Tian · Shuhua Xiong · Yinjie Lei · Chao Ren, ,,https://dl.acm.org/doi/abs/10.1145/3581783.3612535,,,,,nan +UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity,Jialong Zuo · Hanyu Zhou · Ying Nie · Feng Zhang · Tianyu Guo · Nong Sang · Yunhe Wang · Changxin Gao, ,https://arxiv.org/abs/2312.03441v4,,2312.03441v4.pdf,UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity,"Existing text-based person retrieval datasets often have relatively +coarse-grained text annotations. This hinders the model to comprehend the +fine-grained semantics of query texts in real scenarios. To address this +problem, we contribute a new benchmark named \textbf{UFineBench} for text-based +person retrieval with ultra-fine granularity. + Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a +large number of person images and manually annotate each image with two +detailed textual descriptions, averaging 80.8 words each. The average word +count is three to four times that of the previous datasets. In addition of +standard in-domain evaluation, we also propose a special \textbf{evaluation +paradigm} more representative of real scenarios. It contains a new evaluation +set with cross domains, cross textual granularity and cross textual styles, +named UFine3C, and a new evaluation metric for accurately measuring retrieval +ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a +more efficient \textbf{algorithm} especially designed for text-based person +retrieval with ultra fine-grained texts. It achieves fine granularity mining by +adopting a shared cross-modal granularity decoder and hard negative match +mechanism. + With standard in-domain evaluation, CFAM establishes competitive performance +across various datasets, especially on our ultra fine-grained UFine6926. +Furthermore, by evaluating on UFine3C, we demonstrate that training on our +UFine6926 significantly improves generalization to real scenarios compared with +other coarse-grained datasets. The dataset and code will be made publicly +available at \url{https://github.com/Zplusdragon/UFineBench}.",cs.CV,['cs.CV'] +Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency,Xu Yingjie · Bangzhen Liu · Hao Tang · Bailin Deng · Shengfeng He, ,https://arxiv.org/abs/2403.17638v1,,2403.17638v1.pdf,Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency,"We propose a voxel-based optimization framework, ReVoRF, for few-shot +radiance fields that strategically address the unreliability in pseudo novel +view synthesis. Our method pivots on the insight that relative depth +relationships within neighboring regions are more reliable than the absolute +color values in disoccluded areas. Consequently, we devise a bilateral +geometric consistency loss that carefully navigates the trade-off between color +fidelity and geometric accuracy in the context of depth consistency for +uncertain regions. Moreover, we present a reliability-guided learning strategy +to discern and utilize the variable quality across synthesized views, +complemented by a reliability-aware voxel smoothing algorithm that smoothens +the transition between reliable and unreliable data patches. Our approach +allows for a more nuanced use of all available data, promoting enhanced +learning from regions previously considered unsuitable for high-quality +reconstruction. Extensive experiments across diverse datasets reveal that our +approach attains significant gains in efficiency and accuracy, delivering +rendering speeds of 3 FPS, 7 mins to train a $360^\circ$ scene, and a 5\% +improvement in PSNR over existing few-shot methods. Code is available at +https://github.com/HKCLynn/ReVoRF.",cs.CV,['cs.CV'] +Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes,Takashi Otonari · Satoshi Ikehata · Kiyoharu Aizawa, ,https://arxiv.org/abs/2403.16141,,2403.16141.pdf,Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes,"Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic +scenes often involve explicit modeling of scene dynamics. However, this +approach faces challenges in modeling scene dynamics in urban environments, +where moving objects of various categories and scales are present. In such +settings, it becomes crucial to effectively eliminate moving objects to +accurately reconstruct static backgrounds. Our research introduces an +innovative method, termed here as Entity-NeRF, which combines the strengths of +knowledge-based and statistical strategies. This approach utilizes entity-wise +statistics, leveraging entity segmentation and stationary entity classification +through thing/stuff segmentation. To assess our methodology, we created an +urban scene dataset masked with moving objects. Our comprehensive experiments +demonstrate that Entity-NeRF notably outperforms existing techniques in +removing moving objects and reconstructing static urban backgrounds, both +quantitatively and qualitatively.",cs.CV,['cs.CV'] +PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation,Yuqi Wang · Yuntao Chen · Xingyu Liao · Lue Fan · Zhaoxiang Zhang, ,https://arxiv.org/abs/2306.10013,,2306.10013.pdf,PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation,"Comprehensive modeling of the surrounding 3D world is key to the success of +autonomous driving. However, existing perception tasks like object detection, +road structure segmentation, depth & elevation estimation, and open-set object +localization each only focus on a small facet of the holistic 3D scene +understanding task. This divide-and-conquer strategy simplifies the algorithm +development procedure at the cost of losing an end-to-end unified solution to +the problem. In this work, we address this limitation by studying camera-based +3D panoptic segmentation, aiming to achieve a unified occupancy representation +for camera-only 3D scene understanding. To achieve this, we introduce a novel +method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal +information from multi-frame and multi-view images in a coarse-to-fine scheme, +integrating feature learning and scene representation into a unified occupancy +representation. We have conducted extensive ablation studies to verify the +effectiveness and efficiency of the proposed method. Our approach achieves new +state-of-the-art results for camera-based semantic segmentation and panoptic +segmentation on the nuScenes dataset. Furthermore, our method can be easily +extended to dense occupancy prediction and has shown promising performance on +the Occ3D benchmark. The code will be released at +https://github.com/Robertwyq/PanoOcc.",cs.CV,"['cs.CV', 'cs.RO']" +HIT: Estimating Internal Human Implicit Tissues from the Body Surface,Marilyn Keller · Vaibhav ARORA · Abdelmouttaleb Dakri · Shivam Chandhok · Jürgen Machann · Andreas Fritsche · Michael J. Black · Sergi Pujades,https://hit.is.tue.mpg.de,,https://www.youtube.com/watch?v=3u4emFF3DcE,,,,,nan +Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation,Renshuai Liu · Bowen Ma · Wei Zhang · Zhipeng Hu · Changjie Fan · Tangjie Lv · Yu Ding · Xuan Cheng, ,https://arxiv.org/abs/2401.01207,,2401.01207.pdf,Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation,"In human-centric content generation, the pre-trained text-to-image models +struggle to produce user-wanted portrait images, which retain the identity of +individuals while exhibiting diverse expressions. This paper introduces our +efforts towards personalized face generation. To this end, we propose a novel +multi-modal face generation framework, capable of simultaneous +identity-expression control and more fine-grained expression synthesis. Our +expression control is so sophisticated that it can be specialized by the +fine-grained emotional vocabulary. We devise a novel diffusion model that can +undertake the task of simultaneously face swapping and reenactment. Due to the +entanglement of identity and expression, it's nontrivial to separately and +precisely control them in one framework, thus has not been explored yet. To +overcome this, we propose several innovative designs in the conditional +diffusion model, including balancing identity and expression encoder, improved +midpoint sampling, and explicitly background conditioning. Extensive +experiments have demonstrated the controllability and scalability of the +proposed framework, in comparison with state-of-the-art text-to-image, face +swapping, and face reenactment methods.",cs.CV,['cs.CV'] +3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation,Songchun Zhang · Yibo Zhang · Quan Zheng · Rui Ma · Wei Hua · Hujun Bao · Weiwei Xu · Changqing Zou, ,https://arxiv.org/abs/2403.09439,,2403.09439.pdf,3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation,"Text-driven 3D scene generation techniques have made rapid progress in recent +years. Their success is mainly attributed to using existing generative models +to iteratively perform image warping and inpainting to generate 3D scenes. +However, these methods heavily rely on the outputs of existing models, leading +to error accumulation in geometry and appearance that prevent the models from +being used in various scenarios (e.g., outdoor and unreal scenarios). To +address this limitation, we generatively refine the newly generated local views +by querying and aggregating global 3D information, and then progressively +generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF +as a unified representation of the 3D scene to constrain global 3D consistency, +and propose a generative refinement network to synthesize new contents with +higher quality by exploiting the natural image prior from 2D diffusion model as +well as the global 3D information of the current scene. Our extensive +experiments demonstrate that, in comparison to previous methods, our approach +supports wide variety of scene generation and arbitrary camera trajectories +with improved visual quality and 3D consistency.",cs.CV,"['cs.CV', 'cs.AI']" +Accurate Spatial Gene Expression Prediction by Integrating Multi-Resolution Features,Youngmin Chung · Ji Hun Ha · Kyeong Chan Im · Joo Sang Lee, ,https://arxiv.org/abs/2403.07592v1,,2403.07592v1.pdf,Accurate Spatial Gene Expression Prediction by integrating Multi-resolution features,"Recent advancements in Spatial Transcriptomics (ST) technology have +facilitated detailed gene expression analysis within tissue contexts. However, +the high costs and methodological limitations of ST necessitate a more robust +predictive model. In response, this paper introduces TRIPLEX, a novel deep +learning framework designed to predict spatial gene expression from Whole Slide +Images (WSIs). TRIPLEX uniquely harnesses multi-resolution features, capturing +cellular morphology at individual spots, the local context around these spots, +and the global tissue organization. By integrating these features through an +effective fusion strategy, TRIPLEX achieves accurate gene expression +prediction. Our comprehensive benchmark study, conducted on three public ST +datasets and supplemented with Visium data from 10X Genomics, demonstrates that +TRIPLEX outperforms current state-of-the-art models in Mean Squared Error +(MSE), Mean Absolute Error (MAE), and Pearson Correlation Coefficient (PCC). +The model's predictions align closely with ground truth gene expression +profiles and tumor annotations, underscoring TRIPLEX's potential in advancing +cancer diagnosis and treatment.",cs.CV,['cs.CV'] +Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM,Pingping Zhang · Tianyu Yan · Yang Liu · Huchuan Lu, ,https://arxiv.org/abs/2404.04996,,2404.04996.pdf,Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM,"As an important pillar of underwater intelligence, Marine Animal Segmentation +(MAS) involves segmenting animals within marine environments. Previous methods +don't excel in extracting long-range contextual features and overlook the +connectivity between discrete pixels. Recently, Segment Anything Model (SAM) +offers a universal framework for general segmentation tasks. Unfortunately, +trained with natural images, SAM does not obtain the prior knowledge from +marine images. In addition, the single-position prompt of SAM is very +insufficient for prior guidance. To address these issues, we propose a novel +feature learning framework, named Dual-SAM for high-performance MAS. To this +end, we first introduce a dual structure with SAM's paradigm to enhance feature +learning of marine images. Then, we propose a Multi-level Coupled Prompt (MCP) +strategy to instruct comprehensive underwater prior information, and enhance +the multi-level features of SAM's encoder with adapters. Subsequently, we +design a Dilated Fusion Attention Module (DFAM) to progressively integrate +multi-level features from SAM's encoder. Finally, instead of directly +predicting the masks of marine animals, we propose a Criss-Cross Connectivity +Prediction (C$^3$P) paradigm to capture the inter-connectivity between discrete +pixels. With dual decoders, it generates pseudo-labels and achieves mutual +supervision for complementary feature representations, resulting in +considerable improvements over previous techniques. Extensive experiments +verify that our proposed method achieves state-of-the-art performances on five +widely-used MAS datasets. The code is available at +https://github.com/Drchip61/Dual_SAM.",cs.CV,"['cs.CV', 'cs.MM']" +Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans,Romain Loiseau · Elliot Vincent · Mathieu Aubry · Loic Landrieu,https://romainloiseau.fr/learnable-earth-parser/,,https://www.youtube.com/watch?v=0PkxeT17e8Q,,,,,nan +Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications,Yuwen Xiong · Zhiqi Li · Yuntao Chen · Feng Wang · Xizhou Zhu · Jiapeng Luo · Wenhai Wang · Tong Lu · Hongsheng Li · Yu Qiao · Lewei Lu · Jie Zhou · Jifeng Dai, ,https://arxiv.org/abs/2401.06197,,2401.06197.pdf,Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications,"We introduce Deformable Convolution v4 (DCNv4), a highly efficient and +effective operator designed for a broad spectrum of vision applications. DCNv4 +addresses the limitations of its predecessor, DCNv3, with two key enhancements: +1. removing softmax normalization in spatial aggregation to enhance its dynamic +property and expressive power and 2. optimizing memory access to minimize +redundant operations for speedup. These improvements result in a significantly +faster convergence compared to DCNv3 and a substantial increase in processing +speed, with DCNv4 achieving more than three times the forward speed. DCNv4 +demonstrates exceptional performance across various tasks, including image +classification, instance and semantic segmentation, and notably, image +generation. When integrated into generative models like U-Net in the latent +diffusion model, DCNv4 outperforms its baseline, underscoring its possibility +to enhance generative models. In practical applications, replacing DCNv3 with +DCNv4 in the InternImage model to create FlashInternImage results in up to 80% +speed increase and further performance improvement without further +modifications. The advancements in speed and efficiency of DCNv4, combined with +its robust performance across diverse vision tasks, show its potential as a +foundational building block for future vision models.",cs.CV,['cs.CV'] +A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization,Hongwei Ren · Jiadong Zhu · Yue Zhou · Haotian FU · Yulong Huang · Bojun Cheng, ,https://arxiv.org/abs/2403.19412,,2403.19412.pdf,A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization,"Event cameras exhibit remarkable attributes such as high dynamic range, +asynchronicity, and low latency, making them highly suitable for vision tasks +that involve high-speed motion in challenging lighting conditions. These +cameras implicitly capture movement and depth information in events, making +them appealing sensors for Camera Pose Relocalization (CPR) tasks. +Nevertheless, existing CPR networks based on events neglect the pivotal +fine-grained temporal information in events, resulting in unsatisfactory +performance. Moreover, the energy-efficient features are further compromised by +the use of excessively complex models, hindering efficient deployment on edge +devices. In this paper, we introduce PEPNet, a simple and effective point-based +network designed to regress six degrees of freedom (6-DOFs) event camera poses. +We rethink the relationship between the event camera and CPR tasks, leveraging +the raw Point Cloud directly as network input to harness the high-temporal +resolution and inherent sparsity of events. PEPNet is adept at abstracting the +spatial and implicit temporal features through hierarchical structure and +explicit temporal features by Attentive Bi-directional Long Short-Term Memory +(A-Bi-LSTM). By employing a carefully crafted lightweight design, PEPNet +delivers state-of-the-art (SOTA) performance on both indoor and outdoor +datasets with meager computational resources. Specifically, PEPNet attains a +significant 38% and 33% performance improvement on the random split IJRR and +M3ED datasets, respectively. Moreover, the lightweight design version +PEPNet$_{tiny}$ accomplishes results comparable to the SOTA while employing a +mere 0.5% of the parameters.",cs.CV,['cs.CV'] +Attention Calibration for Disentangled Text-to-Image Personalization,Yanbing Zhang · Mengping Yang · Qin Zhou · Zhe Wang, ,https://arxiv.org/abs/2403.18551,,2403.18551.pdf,Attention Calibration for Disentangled Text-to-Image Personalization,"Recent thrilling progress in large-scale text-to-image (T2I) models has +unlocked unprecedented synthesis quality of AI-generated content (AIGC) +including image generation, 3D and video composition. Further, personalized +techniques enable appealing customized production of a novel concept given only +several images as reference. However, an intriguing problem persists: Is it +possible to capture multiple, novel concepts from one single reference image? +In this paper, we identify that existing approaches fail to preserve visual +consistency with the reference image and eliminate cross-influence from +concepts. To alleviate this, we propose an attention calibration mechanism to +improve the concept-level understanding of the T2I model. Specifically, we +first introduce new learnable modifiers bound with classes to capture +attributes of multiple concepts. Then, the classes are separated and +strengthened following the activation of the cross-attention operation, +ensuring comprehensive and self-contained concepts. Additionally, we suppress +the attention activation of different classes to mitigate mutual influence +among concepts. Together, our proposed method, dubbed DisenDiff, can learn +disentangled multiple concepts from one single image and produce novel +customized images with learned concepts. We demonstrate that our method +outperforms the current state of the art in both qualitative and quantitative +evaluations. More importantly, our proposed techniques are compatible with LoRA +and inpainting pipelines, enabling more interactive experiences.",cs.CV,['cs.CV'] +LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching,Yixun Liang · Xin Yang · Jiantao Lin · Haodong LI · Xiaogang Xu · Ying-Cong Chen, ,https://arxiv.org/abs/2311.11284,,2311.11284.pdf,LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching,"The recent advancements in text-to-3D generation mark a significant milestone +in generative models, unlocking new possibilities for creating imaginative 3D +assets across various real-world scenarios. While recent advancements in +text-to-3D generation have shown promise, they often fall short in rendering +detailed and high-quality 3D models. This problem is especially prevalent as +many methods base themselves on Score Distillation Sampling (SDS). This paper +identifies a notable deficiency in SDS, that it brings inconsistent and +low-quality updating direction for the 3D model, causing the over-smoothing +effect. To address this, we propose a novel approach called Interval Score +Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes +interval-based score matching to counteract over-smoothing. Furthermore, we +incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline. +Extensive experiments show that our model largely outperforms the +state-of-the-art in quality and training efficiency.",cs.CV,"['cs.CV', 'cs.GR', 'cs.MM']" +Object Dynamics Modeling with Hierarchical Point Cloud-based Representations,Chanho Kim · Li Fuxin, ,https://arxiv.org/abs/2404.06044,,2404.06044.pdf,Object Dynamics Modeling with Hierarchical Point Cloud-based Representations,"Modeling object dynamics with a neural network is an important problem with +numerous applications. Most recent work has been based on graph neural +networks. However, physics happens in 3D space, where geometric information +potentially plays an important role in modeling physical phenomena. In this +work, we propose a novel U-net architecture based on continuous point +convolution which naturally embeds information from 3D coordinates and allows +for multi-scale feature representations with established downsampling and +upsampling procedures. Bottleneck layers in the downsampled point clouds lead +to better long-range interaction modeling. Besides, the flexibility of point +convolutions allows our approach to generalize to sparsely sampled points from +mesh vertices and dynamically generate features on important interaction points +on mesh faces. Experimental results demonstrate that our approach significantly +improves the state-of-the-art, especially in scenarios that require accurate +gravity or collision reasoning.",cs.CV,['cs.CV'] +CAD: Photorealistic 3D Generation via Adversarial Distillation,Ziyu Wan · Despoina Paschalidou · Ian Huang · Hongyu Liu · Bokui Shen · Xiaoyu Xiang · Jing Liao · Leonidas Guibas,http://raywzy.com/CAD/,https://arxiv.org/abs/2312.06663,,2312.06663.pdf,CAD: Photorealistic 3D Generation via Adversarial Distillation,"The increased demand for 3D data in AR/VR, robotics and gaming applications, +gave rise to powerful generative pipelines capable of synthesizing high-quality +3D objects. Most of these models rely on the Score Distillation Sampling (SDS) +algorithm to optimize a 3D representation such that the rendered image +maintains a high likelihood as evaluated by a pre-trained diffusion model. +However, finding a correct mode in the high-dimensional distribution produced +by the diffusion model is challenging and often leads to issues such as +over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we +propose a novel learning paradigm for 3D synthesis that utilizes pre-trained +diffusion models. Instead of focusing on mode-seeking, our method directly +models the distribution discrepancy between multi-view renderings and diffusion +priors in an adversarial manner, which unlocks the generation of high-fidelity +and photorealistic 3D content, conditioned on a single image and prompt. +Moreover, by harnessing the latent space of GANs and expressive diffusion model +priors, our method facilitates a wide variety of 3D applications including +single-view reconstruction, high diversity generation and continuous 3D +interpolation in the open domain. The experiments demonstrate the superiority +of our pipeline compared to previous works in terms of generation quality and +diversity.",cs.CV,"['cs.CV', 'cs.GR']" +Gaussian Shell Maps for Efficient 3D Human Generation,Rameen Abdal · Wang Yifan · Zifan Shi · Yinghao Xu · Ryan Po · Zhengfei Kuang · Qifeng Chen · Dit-Yan Yeung · Gordon Wetzstein,https://rameenabdal.github.io/GaussianShellMaps/,https://arxiv.org/abs/2311.17857v1,,2311.17857v1.pdf,Gaussian Shell Maps for Efficient 3D Human Generation,"Efficient generation of 3D digital humans is important in several industries, +including virtual reality, social media, and cinematic production. 3D +generative adversarial networks (GANs) have demonstrated state-of-the-art +(SOTA) quality and diversity for generated assets. Current 3D GAN +architectures, however, typically rely on volume representations, which are +slow to render, thereby hampering the GAN training and requiring +multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps +(GSMs) as a framework that connects SOTA generator network architectures with +emerging 3D Gaussian rendering primitives using an articulable multi +shell--based scaffold. In this setting, a CNN generates a 3D texture stack with +features that are mapped to the shells. The latter represent inflated and +deflated versions of a template surface of a digital human in a canonical body +pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the +shells whose attributes are encoded in the texture features. These Gaussians +are efficiently and differentiably rendered. The ability to articulate the +shells is important during GAN training and, at inference time, to deform a +body into arbitrary user-defined poses. Our efficient rendering scheme bypasses +the need for view-inconsistent upsamplers and achieves high-quality multi-view +consistent renderings at a native resolution of $512 \times 512$ pixels. We +demonstrate that GSMs successfully generate 3D humans when trained on +single-view datasets, including SHHQ and DeepFashion.",cs.CV,"['cs.CV', 'cs.GR']" +3D-Aware Face Editing via Warping-Guided Latent Direction Learning,Yuhao Cheng · Zhuo Chen · Xingyu Ren · Wenhan Zhu · Zhengqin Xu · Di Xu · Yang Changpeng · Yichao Yan, ,https://arxiv.org/abs/2402.14000,,2402.14000.pdf,Real-time 3D-aware Portrait Editing from a Single Image,"This work presents 3DPE, a practical method that can efficiently edit a face +image following given prompts, like reference images or text descriptions, in a +3D-aware manner. To this end, a lightweight module is distilled from a 3D +portrait generator and a text-to-image model, which provide prior knowledge of +face geometry and superior editing capability, respectively. Such a design +brings two compelling advantages over existing approaches. First, our system +achieves real-time editing with a feedforward network (i.e., ~0.04s per image), +over 100x faster than the second competitor. Second, thanks to the powerful +priors, our module could focus on the learning of editing-related variations, +such that it manages to handle various types of editing simultaneously in the +training phase and further supports fast adaptation to user-specified +customized types of editing during inference (e.g., with ~5min fine-tuning per +style). The code, the model, and the interface will be made publicly available +to facilitate future research.",cs.CV,['cs.CV'] +NeRFiller: Completing Scenes via Generative 3D Inpainting,Ethan Weber · Aleksander Holynski · Varun Jampani · Saurabh Saxena · Noah Snavely · Abhishek Kar · Angjoo Kanazawa,https://ethanweber.me/nerfiller/,https://arxiv.org/abs/2312.04560,,2312.04560.pdf,NeRFiller: Completing Scenes via Generative 3D Inpainting,"We propose NeRFiller, an approach that completes missing portions of a 3D +capture via generative 3D inpainting using off-the-shelf 2D visual generative +models. Often parts of a captured 3D scene or object are missing due to mesh +reconstruction failures or a lack of observations (e.g., contact regions, such +as the bottom of objects, or hard-to-reach areas). We approach this challenging +3D inpainting problem by leveraging a 2D inpainting diffusion model. We +identify a surprising behavior of these models, where they generate more 3D +consistent inpaints when images form a 2$\times$2 grid, and show how to +generalize this behavior to more than four images. We then present an iterative +framework to distill these inpainted regions into a single consistent 3D scene. +In contrast to related works, we focus on completing scenes rather than +deleting foreground objects, and our approach does not require tight 2D object +masks or text. We compare our approach to relevant baselines adapted to our +setting on a variety of scenes, where NeRFiller creates the most 3D consistent +and plausible scene completions. Our project page is at +https://ethanweber.me/nerfiller.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']" +Exploring the Transferability of Visual Prompting for Multimodal Large Language Models,Yichi Zhang · Yinpeng Dong · Siyuan Zhang · Tianzan Min · Hang Su · Jun Zhu, ,https://arxiv.org/abs/2404.11207v1,,2404.11207v1.pdf,Exploring the Transferability of Visual Prompting for Multimodal Large Language Models,"Although Multimodal Large Language Models (MLLMs) have demonstrated promising +versatile capabilities, their performance is still inferior to specialized +models on downstream tasks, which makes adaptation necessary to enhance their +utility. However, fine-tuning methods require independent training for every +model, leading to huge computation and memory overheads. In this paper, we +propose a novel setting where we aim to improve the performance of diverse +MLLMs with a group of shared parameters optimized for a downstream task. To +achieve this, we propose Transferable Visual Prompting (TVP), a simple and +effective approach to generate visual prompts that can transfer to different +models and improve their performance on downstream tasks after trained on only +one model. We introduce two strategies to address the issue of cross-model +feature corruption of existing visual prompting methods and enhance the +transferability of the learned prompts, including 1) Feature Consistency +Alignment: which imposes constraints to the prompted feature changes to +maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which +encourages the prompted images to contain richer task-specific semantics with +language guidance. We validate the effectiveness of TVP through extensive +experiments with 6 modern MLLMs on a wide variety of tasks ranging from object +recognition and counting to multimodal reasoning and hallucination correction.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Utility-Fairness Trade-Offs and How to Find Them,Sepehr Dehdashtian · Bashir Sadeghi · Vishnu Naresh Boddeti,https://sepehrdehdashtian.github.io/Papers/U-FaTE/index.html,https://arxiv.org/abs/2404.09454v1,,2404.09454v1.pdf,Utility-Fairness Trade-Offs and How to Find Them,"When building classification systems with demographic fairness +considerations, there are two objectives to satisfy: 1) maximizing utility for +the specific task and 2) ensuring fairness w.r.t. a known demographic +attribute. These objectives often compete, so optimizing both can lead to a +trade-off between utility and fairness. While existing works acknowledge the +trade-offs and study their limits, two questions remain unanswered: 1) What are +the optimal trade-offs between utility and fairness? and 2) How can we +numerically quantify these trade-offs from data for a desired prediction task +and demographic attribute of interest? This paper addresses these questions. We +introduce two utility-fairness trade-offs: the Data-Space and Label-Space +Trade-off. The trade-offs reveal three regions within the utility-fairness +plane, delineating what is fully and partially possible and impossible. We +propose U-FaTE, a method to numerically quantify the trade-offs for a given +prediction task and group fairness definition from data samples. Based on the +trade-offs, we introduce a new scheme for evaluating representations. An +extensive evaluation of fair representation learning methods and +representations from over 1000 pre-trained models revealed that most current +approaches are far from the estimated and achievable fairness-utility +trade-offs across multiple datasets and prediction tasks.",cs.CV,"['cs.CV', 'cs.CY', 'cs.LG']" +Observation-Guided Diffusion Probabilistic Models,Junoh Kang · Jinyoung Choi · Sungik Choi · Bohyung Han, ,https://arxiv.org/abs/2310.04041,,2310.04041.pdf,Observation-Guided Diffusion Probabilistic Models,"We propose a novel diffusion-based image generation method called the +observation-guided diffusion probabilistic model (OGDM), which effectively +addresses the tradeoff between quality control and fast sampling. Our approach +reestablishes the training objective by integrating the guidance of the +observation process with the Markov chain in a principled way. This is achieved +by introducing an additional loss term derived from the observation based on a +conditional discriminator on noise level, which employs a Bernoulli +distribution indicating whether its input lies on the (noisy) real manifold or +not. This strategy allows us to optimize the more accurate negative +log-likelihood induced in the inference stage especially when the number of +function evaluations is limited. The proposed training scheme is also +advantageous even when incorporated only into the fine-tuning process, and it +is compatible with various fast inference strategies since our method yields +better denoising networks using the exactly the same inference procedure +without incurring extra computational cost. We demonstrate the effectiveness of +our training algorithm using diverse inference techniques on strong diffusion +model baselines. Our implementation is available at +https://github.com/Junoh-Kang/OGDM_edm.",cs.LG,"['cs.LG', 'cs.AI']" +FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition,Ganggui Ding · Canyu Zhao · Wen Wang · Zhen Yang · Zide Liu · Hao Chen · Chunhua Shen,https://aim-uofa.github.io/FreeCustom/,https://arxiv.org/abs/2405.13870,,2405.13870.pdf,FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition,"Benefiting from large-scale pre-trained text-to-image (T2I) generative +models, impressive progress has been achieved in customized image generation, +which aims to generate user-specified concepts. Existing approaches have +extensively focused on single-concept customization and still encounter +challenges when it comes to complex scenarios that involve combining multiple +concepts. These approaches often require retraining/fine-tuning using a few +images, leading to time-consuming training processes and impeding their swift +implementation. Furthermore, the reliance on multiple images to represent a +singular concept increases the difficulty of customization. To this end, we +propose FreeCustom, a novel tuning-free method to generate customized images of +multi-concept composition based on reference concepts, using only one image per +concept as input. Specifically, we introduce a new multi-reference +self-attention (MRSA) mechanism and a weighted mask strategy that enables the +generated image to access and focus more on the reference concepts. In +addition, MRSA leverages our key finding that input concepts are better +preserved when providing images with context interactions. Experiments show +that our method's produced images are consistent with the given concepts and +better aligned with the input text. Our method outperforms or performs on par +with other training-based methods in terms of multi-concept composition and +single-concept customization, but is simpler. Codes can be found at +https://github.com/aim-uofa/FreeCustom.",cs.CV,['cs.CV'] +ModaVerse: Efficiently Transforming Modalities with LLMs,Xinyu Wang · Bohan Zhuang · Qi Wu, ,https://arxiv.org/abs/2401.06395,,2401.06395.pdf,ModaVerse: Efficiently Transforming Modalities with LLMs,"Humans possess the capability to comprehend diverse modalities and seamlessly +transfer information between them. In this work, we introduce ModaVerse, a +Multi-modal Large Language Model (MLLM) capable of comprehending and +transforming content across various modalities including images, videos, and +audio. Predominant MLLM frameworks have largely relied on the alignment of +latent spaces of textual and non-textual features. This alignment process, +which synchronizes a language model trained on textual data with encoders and +decoders trained on multi-modal data, often necessitates extensive training of +several projection layers in multiple stages. Inspired by LLM-as-agent +methodologies, we propose a novel Input/Output (I/O) alignment mechanism that +operates directly at the level of natural language. It aligns the LLM's output +with the input of generative models, avoiding the complexities associated with +latent feature alignments, and simplifying the multiple training stages of +existing MLLMs into a single, efficient process. This conceptual advancement +leads to significant reductions in both data and computational costs. By +conducting experiments on several benchmarks, we demonstrate that our approach +attains comparable performance with the state of the art while achieving +considerable efficiencies in data usage and training duration.",cs.CV,['cs.CV'] +Targeted Representation Alignment for Open-World Semi-Supervised Learning,Ruixuan Xiao · Lei Feng · Kai Tang · Junbo Zhao · Yixuan Li · Gang Chen · Haobo Wang, ,https://arxiv.org/abs/2311.03524,,2311.03524.pdf,A Graph-Theoretic Framework for Understanding Open-World Semi-Supervised Learning,"Open-world semi-supervised learning aims at inferring both known and novel +classes in unlabeled data, by harnessing prior knowledge from a labeled set +with known classes. Despite its importance, there is a lack of theoretical +foundations for this problem. This paper bridges the gap by formalizing a +graph-theoretic framework tailored for the open-world setting, where the +clustering can be theoretically characterized by graph factorization. Our +graph-theoretic framework illuminates practical algorithms and provides +guarantees. In particular, based on our graph formulation, we apply the +algorithm called Spectral Open-world Representation Learning (SORL), and show +that minimizing our loss is equivalent to performing spectral decomposition on +the graph. Such equivalence allows us to derive a provable error bound on the +clustering performance for both known and novel classes, and analyze rigorously +when labeled data helps. Empirically, SORL can match or outperform several +strong baselines on common benchmark datasets, which is appealing for practical +usage while enjoying theoretical guarantees.",cs.LG,['cs.LG'] +PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,Yangyang Guo · Guangzhi Wang · Mohan Kankanhalli, ,https://arxiv.org/abs/2310.10700,,2310.10700.pdf,PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,"Applying a pre-trained large model to downstream tasks is prohibitive under +resource-constrained conditions. Recent dominant approaches for addressing +efficiency issues involve adding a few learnable parameters to the fixed +backbone model. This strategy, however, leads to more challenges in loading +large models for downstream fine-tuning with limited resources. In this paper, +we propose a novel method for increasing the parameter efficiency of +pre-trained models by introducing an intermediate pre-training stage. To this +end, we first employ low-rank approximation to compress the original large +model and then devise a feature distillation module and a weight perturbation +regularization module. These modules are specifically designed to enhance the +low-rank model. In particular, we update only the low-rank model while freezing +the backbone parameters during pre-training. This allows for direct and +efficient utilization of the low-rank model for downstream fine-tuning tasks. +The proposed method achieves both efficiencies in terms of required parameters +and computation time while maintaining comparable results with minimal +modifications to the backbone architecture. Specifically, when applied to three +vision-only and one vision-language Transformer models, our approach often +demonstrates a merely $\sim$0.6 point decrease in performance while reducing +the original parameter size by 1/3 to 2/3.",cs.CV,['cs.CV'] +Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning,Shiming Chen · Wenjin Hou · Salman Khan · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2404.07713,,2404.07713.pdf,Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning,"Zero-shot learning (ZSL) recognizes the unseen classes by conducting +visual-semantic interactions to transfer semantic knowledge from seen classes +to unseen ones, supported by semantic information (e.g., attributes). However, +existing ZSL methods simply extract visual features using a pre-trained network +backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic +correspondences for representing semantic-related visual features as lacking of +the guidance of semantic information, resulting in undesirable visual-semantic +interactions. To tackle this issue, we propose a progressive semantic-guided +vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly +considers two properties in the whole network: i) discover the semantic-related +visual representations explicitly, and ii) discard the semantic-unrelated +visual information. Specifically, we first introduce semantic-embedded token +learning to improve the visual-semantic correspondences via semantic +enhancement and discover the semantic-related visual tokens explicitly with +semantic-guided token attention. Then, we fuse low semantic-visual +correspondence visual tokens to discard the semantic-unrelated visual +information for visual enhancement. These two operations are integrated into +various encoders to progressively learn semantic-related visual representations +for accurate visual-semantic interactions in ZSL. The extensive experiments +show that our ZSLViT achieves significant performance gains on three popular +benchmark datasets, i.e., CUB, SUN, and AWA2.",cs.CV,"['cs.CV', 'cs.LG']" +Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection,Wenjun Hui · Zhenfeng Zhu · Shuai Zheng · Yao Zhao, ,https://arxiv.org/html/2403.01968v1,,2403.01968v1.pdf,Explicit Motion Handling and Interactive Prompting for Video Camouflaged Object Detection,"Camouflage poses challenges in distinguishing a static target, whereas any +movement of the target can break this disguise. Existing video camouflaged +object detection (VCOD) approaches take noisy motion estimation as input or +model motion implicitly, restricting detection performance in complex dynamic +scenes. In this paper, we propose a novel Explicit Motion handling and +Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion +cues explicitly using a frozen pre-trained optical flow fundamental model. EMIP +is characterized by a two-stream architecture for simultaneously conducting +camouflaged segmentation and optical flow estimation. Interactions across the +dual streams are realized in an interactive prompting way that is inspired by +emerging visual prompt learning. Two learnable modules, i.e. the camouflaged +feeder and motion collector, are designed to incorporate segmentation-to-motion +and motion-to-segmentation prompts, respectively, and enhance outputs of the +both streams. The prompt fed to the motion stream is learned by supervising +optical flow in a self-supervised manner. Furthermore, we show that long-term +historical information can also be incorporated as a prompt into EMIP and +achieve more robust results with temporal consistency. Experimental results +demonstrate that our EMIP achieves new state-of-the-art records on popular VCOD +benchmarks. The code will be publicly available.",cs.CV,['cs.CV'] +Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment,Zheren Fu · Lei Zhang · Hou Xia · Zhendong Mao,https://github.com/CrossmodalGroup/LAPS,https://arxiv.org/html/2312.05278v2,,2312.05278v2.pdf,Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects,"Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot +capabilities in various vision-language dialogue scenarios. However, the +absence of fine-grained visual object detection hinders the model from +understanding the details of images, leading to irreparable visual +hallucinations and factual errors. In this paper, we propose Lyrics, a novel +multi-modal pre-training and instruction fine-tuning paradigm that bootstraps +vision-language alignment from fine-grained cross-modal collaboration. Building +on the foundation of BLIP-2, Lyrics infuses local visual features extracted +from a visual refiner that includes image tagging, object detection and +semantic segmentation modules into the Querying Transformer, while on the text +side, the language inputs equip the boundary boxes and tags derived from the +visual refiner. We further introduce a two-stage training scheme, in which the +pre-training stage bridges the modality gap through explicit and comprehensive +vision-language alignment targets. During the instruction fine-tuning stage, we +introduce semantic-aware visual feature extraction, a crucial method that +enables the model to extract informative features from concrete visual objects. +Our approach achieves robust performance on 13 datasets across various +vision-language tasks, and demonstrates promising multi-modal understanding, +perception and conversation capabilities in 11 scenario-based benchmark +toolkits.",cs.CL,['cs.CL'] +"GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation",Zifan Wang · Junyu Chen · Ziqing Chen · Pengwei Xie · Rui Chen · Li Yi, ,https://arxiv.org/abs/2401.00929,,2401.00929.pdf,"GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation","This paper presents GenH2R, a framework for learning generalizable +vision-based human-to-robot (H2R) handover skills. The goal is to equip robots +with the ability to reliably receive objects with unseen geometry handed over +by humans in various complex trajectories. We acquire such generalizability by +learning H2R handover at scale with a comprehensive solution including +procedural simulation assets creation, automated demonstration generation, and +effective imitation learning. We leverage large-scale 3D model repositories, +dexterous grasp generation methods, and curve-based 3D animation to create an +H2R handover simulation environment named \simabbns, surpassing the number of +scenes in existing simulators by three orders of magnitude. We further +introduce a distillation-friendly demonstration generation method that +automatically generates a million high-quality demonstrations suitable for +learning. Finally, we present a 4D imitation learning method augmented by a +future forecasting objective to distill demonstrations into a visuo-motor +handover policy. Experimental evaluations in both simulators and the real world +demonstrate significant improvements (at least +10\% success rate) over +baselines in all cases. The project page is https://GenH2R.github.io/.",cs.RO,"['cs.RO', 'cs.CV']" +HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data,Mengqi Zhang · Yang Fu · Zheng Ding · Sifei Liu · Zhuowen Tu · Xiaolong Wang, ,https://arxiv.org/abs/2403.12011,,2403.12011.pdf,HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data,"3D hand-object interaction data is scarce due to the hardware constraints in +scaling up the data collection process. In this paper, we propose HOIDiffusion +for generating realistic and diverse 3D hand-object interaction data. Our model +is a conditional diffusion model that takes both the 3D hand-object geometric +structure and text description as inputs for image synthesis. This offers a +more controllable and realistic synthesis as we can specify the structure and +style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a +diffusion model pre-trained on large-scale natural images and a few 3D human +demonstrations. Beyond controllable image synthesis, we adopt the generated 3D +data for learning 6D object pose estimation and show its effectiveness in +improving perception systems. Project page: +https://mq-zhang1.github.io/HOIDiffusion",cs.CV,['cs.CV'] +Learning to navigate efficiently and precisely in real environments,Guillaume Bono · Hervé Poirier · Leonid Antsfeld · Gianluca Monaci · Boris Chidlovskii · Christian Wolf, ,https://arxiv.org/abs/2401.14349,,2401.14349.pdf,Learning to navigate efficiently and precisely in real environments,"In the context of autonomous navigation of terrestrial robots, the creation +of realistic models for agent dynamics and sensing is a widespread habit in the +robotics literature and in commercial applications, where they are used for +model based control and/or for localization and mapping. The more recent +Embodied AI literature, on the other hand, focuses on modular or end-to-end +agents trained in simulators like Habitat or AI-Thor, where the emphasis is put +on photo-realistic rendering and scene diversity, but high-fidelity robot +motion is assigned a less privileged role. The resulting sim2real gap +significantly impacts transfer of the trained models to real robotic platforms. +In this work we explore end-to-end training of agents in simulation in settings +which minimize the sim2real gap both, in sensing and in actuation. Our agent +directly predicts (discretized) velocity commands, which are maintained through +closed-loop control in the real robot. The behavior of the real robot +(including the underlying low-level controller) is identified and simulated in +a modified Habitat simulator. Noise models for odometry and localization +further contribute in lowering the sim2real gap. We evaluate on real navigation +scenarios, explore different localization and point goal calculation methods +and report significant gains in performance and robustness compared to prior +work.",cs.RO,"['cs.RO', 'cs.CV']" +TexOct: Generating Textures of 3D Models with Octree-based Diffusion,Jialun Liu · Chenming Wu · Xinqi Liu · Xing Liu · Jinbo Wu · Haotian Peng · Chen Zhao · Haocheng Feng · Jingtuo Liu · Errui Ding, ,https://arxiv.org/html/2403.15009v1,,2403.15009v1.pdf,TexRO: Generating Delicate Textures of 3D Models by Recursive Optimization,"This paper presents TexRO, a novel method for generating delicate textures of +a known 3D mesh by optimizing its UV texture. The key contributions are +two-fold. We propose an optimal viewpoint selection strategy, that finds the +most miniature set of viewpoints covering all the faces of a mesh. Our +viewpoint selection strategy guarantees the completeness of a generated result. +We propose a recursive optimization pipeline that optimizes a UV texture at +increasing resolutions, with an adaptive denoising method that re-uses existing +textures for new texture generation. Through extensive experimentation, we +demonstrate the superior performance of TexRO in terms of texture quality, +detail preservation, visual consistency, and, notably runtime speed, +outperforming other current methods. The broad applicability of TexRO is +further confirmed through its successful use on diverse 3D models.",cs.CV,['cs.CV'] +Explaining CLIP's performance disparities on data from blind/low vision users,Daniela Massiceti · Camilla Longden · Agnieszka Słowik · Samuel Wills · Martin Grayson · Cecily Morrison, ,https://arxiv.org/abs/2311.17315,,2311.17315.pdf,Explaining CLIP's performance disparities on data from blind/low vision users,"Large multi-modal models (LMMs) hold the potential to usher in a new era of +automated visual assistance for people who are blind or low vision (BLV). Yet, +these models have not been systematically evaluated on data captured by BLV +users. We address this by empirically assessing CLIP, a widely-used LMM likely +to underpin many assistive technologies. Testing 25 CLIP variants in a +zero-shot classification task, we find that their accuracy is 15 percentage +points lower on average for images captured by BLV users than web-crawled +images. This disparity stems from CLIP's sensitivities to 1) image content +(e.g. not recognizing disability objects as well as other objects); 2) image +quality (e.g. not being robust to lighting variation); and 3) text content +(e.g. not recognizing objects described by tactile adjectives as well as visual +ones). We delve deeper with a textual analysis of three common pre-training +datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content +is rarely mentioned. We then provide three examples that illustrate how the +performance disparities extend to three downstream models underpinned by CLIP: +OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 +images can mitigate CLIP's quality-of-service disparities for BLV users in some +scenarios, which we discuss alongside a set of other possible mitigations.",cs.CV,['cs.CV'] +Virtual Immunohistochemistry Staining for Histological Images Assisted by Weakly-supervised Learning,Jiahan Li · Jiuyang Dong · Shenjin Huang · Xi Li · Junjun Jiang · Xiaopeng Fan · Yongbing Zhang, ,,https://www.sciencedirect.com/science/article/pii/S0167779924000386,,,,,nan +Bridging Remote Sensors with Multisensor Geospatial Foundation Models,Boran Han · Shuai Zhang · Xingjian Shi · Markus Reichstein, ,https://arxiv.org/abs/2404.01260,,2404.01260.pdf,Bridging Remote Sensors with Multisensor Geospatial Foundation Models,"In the realm of geospatial analysis, the diversity of remote sensors, +encompassing both optical and microwave technologies, offers a wealth of +distinct observational capabilities. Recognizing this, we present msGFM, a +multisensor geospatial foundation model that effectively unifies data from four +key sensor modalities. This integration spans an expansive dataset of two +million multisensor images. msGFM is uniquely adept at handling both paired and +unpaired sensor data. For data originating from identical geolocations, our +model employs an innovative cross-sensor pretraining approach in masked image +modeling, enabling the synthesis of joint representations from diverse sensors. +msGFM, incorporating four remote sensors, upholds strong performance, forming a +comprehensive model adaptable to various sensor types. msGFM has demonstrated +enhanced proficiency in a range of both single-sensor and multisensor +downstream tasks. These include scene classification, segmentation, cloud +removal, and pan-sharpening. A key discovery of our research is that +representations derived from natural images are not always compatible with the +distinct characteristics of geospatial remote sensors, underscoring the +limitations of existing representations in this field. Our work can serve as a +guide for developing multisensor geospatial pretraining models, paving the way +for more advanced geospatial capabilities.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Active Generalized Category Discovery,Shijie Ma · Fei Zhu · Zhun Zhong · Xu-Yao Zhang · Cheng-Lin Liu,https://github.com/mashijie1028/ActiveGCD,https://arxiv.org/abs/2403.04272v1,,2403.04272v1.pdf,Active Generalized Category Discovery,"Generalized Category Discovery (GCD) is a pragmatic and challenging +open-world task, which endeavors to cluster unlabeled samples from both novel +and old classes, leveraging some labeled data of old classes. Given that +knowledge learned from old classes is not fully transferable to new classes, +and that novel categories are fully unlabeled, GCD inherently faces intractable +problems, including imbalanced classification performance and inconsistent +confidence between old and new classes, especially in the low-labeling regime. +Hence, some annotations of new classes are deemed necessary. However, labeling +new classes is extremely costly. To address this issue, we take the spirit of +active learning and propose a new setting called Active Generalized Category +Discovery (AGCD). The goal is to improve the performance of GCD by actively +selecting a limited amount of valuable samples for labeling from the oracle. To +solve this problem, we devise an adaptive sampling strategy, which jointly +considers novelty, informativeness and diversity to adaptively select novel +samples with proper uncertainty. However, owing to the varied orderings of +label indices caused by the clustering of novel classes, the queried labels are +not directly applicable to subsequent training. To overcome this issue, we +further propose a stable label mapping algorithm that transforms ground truth +labels to the label space of the classifier, thereby ensuring consistent +training across different active selection stages. Our method achieves +state-of-the-art performance on both generic and fine-grained datasets. Our +code is available at https://github.com/mashijie1028/ActiveGCD",cs.CV,['cs.CV'] +PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation,Zhenyu Li · Shariq Bhat · Peter Wonka, ,https://arxiv.org/abs/2312.02284,,2312.02284.pdf,PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation,"Single image depth estimation is a foundational task in computer vision and +generative modeling. However, prevailing depth estimation models grapple with +accommodating the increasing resolutions commonplace in today's consumer +cameras and devices. Existing high-resolution strategies show promise, but they +often face limitations, ranging from error propagation to the loss of +high-frequency details. We present PatchFusion, a novel tile-based framework +with three key components to improve the current state of the art: (1) A +patch-wise fusion network that fuses a globally-consistent coarse prediction +with finer, inconsistent tiled predictions via high-level feature guidance, (2) +A Global-to-Local (G2L) module that adds vital context to the fusion network, +discarding the need for patch selection heuristics, and (3) A Consistency-Aware +Training (CAT) and Inference (CAI) approach, emphasizing patch overlap +consistency and thereby eradicating the necessity for post-processing. +Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that +our framework can generate high-resolution depth maps with intricate details. +PatchFusion is independent of the base model for depth estimation. Notably, our +framework built on top of SOTA ZoeDepth brings improvements for a total of +17.3% and 29.4% in terms of the root mean squared error (RMSE) on +UnrealStereo4K and MVS-Synth, respectively.",cs.CV,['cs.CV'] +Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,Xianghui Xie · Bharat Lal Bhatnagar · Jan Lenssen · Gerard Pons-Moll, ,https://arxiv.org/abs/2312.07063,,2312.07063.pdf,Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,"Reconstructing human-object interaction in 3D from a single RGB image is a +challenging task and existing data driven methods do not generalize beyond the +objects present in the carefully curated 3D interaction datasets. Capturing +large-scale real data to learn strong interaction and 3D shape priors is very +expensive due to the combinatorial nature of human-object interactions. In this +paper, we propose ProciGen (Procedural interaction Generation), a method to +procedurally generate datasets with both, plausible interaction and diverse +object variation. We generate 1M+ human-object interaction pairs in 3D and +leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), +a novel method to reconstruct interacting human and unseen objects, without any +templates. Our HDM is an image-conditioned diffusion model that learns both +realistic interaction and highly accurate human and object shapes. Experiments +show that our HDM trained with ProciGen significantly outperforms prior methods +that requires template meshes and that our dataset allows training methods with +strong generalization ability to unseen object instances. Our code and data are +released.",cs.CV,['cs.CV'] +Transferable and Principled Efficiency for Open-Vocabulary Segmentation,Jingxuan Xu · Wuyang Chen · Yao Zhao · Yunchao Wei, ,https://arxiv.org/abs/2404.07448,,2404.07448.pdf,Transferable and Principled Efficiency for Open-Vocabulary Segmentation,"Recent success of pre-trained foundation vision-language models makes +Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance, +this approach introduces heavy computational overheads for two challenges: 1) +large model sizes of the backbone; 2) expensive costs during the fine-tuning. +These challenges hinder this OVS strategy from being widely applicable and +affordable in real-world scenarios. Although traditional methods such as model +compression and efficient fine-tuning can address these challenges, they often +rely on heuristics. This means that their solutions cannot be easily +transferred and necessitate re-training on different models, which comes at a +cost. In the context of efficient OVS, we target achieving performance that is +comparable to or even better than prior OVS works based on large +vision-language foundation models, by utilizing smaller models that incur lower +training costs. The core strategy is to make our efficiency principled and thus +seamlessly transferable from one OVS framework to others without further +customization. Comprehensive experiments on diverse OVS benchmarks demonstrate +our superior trade-off between segmentation accuracy and computation costs over +previous works. Our code is available on https://github.com/Xujxyang/OpenTrans",cs.CV,"['cs.CV', 'cs.CL', 'eess.IV']" +Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning,Haoyu Chen · Wenbo Li · Jinjin Gu · Jingjing Ren · Haoze Sun · Xueyi Zou · Youliang Yan · Zhensong Zhang · Lei Zhu, ,https://arxiv.org/abs/2403.02601,,2403.02601.pdf,Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning,"For image super-resolution (SR), bridging the gap between the performance on +synthetic datasets and real-world degradation scenarios remains a challenge. +This work introduces a novel ""Low-Res Leads the Way"" (LWay) training framework, +merging Supervised Pre-training with Self-supervised Learning to enhance the +adaptability of SR models to real-world images. Our approach utilizes a +low-resolution (LR) reconstruction network to extract degradation embeddings +from LR images, merging them with super-resolved outputs for LR reconstruction. +Leveraging unseen LR images for self-supervised learning guides the model to +adapt its modeling space to the target domain, facilitating fine-tuning of SR +models without requiring paired high-resolution (HR) images. The integration of +Discrete Wavelet Transform (DWT) further refines the focus on high-frequency +details. Extensive evaluations show that our method significantly improves the +generalization and detail restoration capabilities of SR models on unseen +real-world datasets, outperforming existing methods. Our training regime is +universally compatible, requiring no network architecture modifications, making +it a practical solution for real-world SR applications.",eess.IV,"['eess.IV', 'cs.CV']" +BodyMAP - Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed,Abhishek Tandon · Anujraaj Goyal · Henry M. Clever · Zackory Erickson, ,https://arxiv.org/abs/2404.03183,,2404.03183.pdf,BodyMAP -- Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed,"Accurately predicting the 3D human posture and the pressure exerted on the +body for people resting in bed, visualized as a body mesh (3D pose & shape) +with a 3D pressure map, holds significant promise for healthcare applications, +particularly, in the prevention of pressure ulcers. Current methods focus on +singular facets of the problem -- predicting only 2D/3D poses, generating 2D +pressure images, predicting pressure only for certain body regions instead of +the full body, or forming indirect approximations to the 3D pressure map. In +contrast, we introduce BodyMAP, which jointly predicts the human body mesh and +3D applied pressure map across the entire human body. Our network leverages +multiple visual modalities, incorporating both a depth image of a person in bed +and its corresponding 2D pressure image acquired from a pressure-sensing +mattress. The 3D pressure map is represented as a pressure value at each mesh +vertex and thus allows for precise localization of high-pressure regions on the +body. Additionally, we present BodyMAP-WS, a new formulation of pressure +prediction in which we implicitly learn pressure in 3D by aligning sensed 2D +pressure images with a differentiable 2D projection of the predicted 3D +pressure maps. In evaluations with real-world human data, our method +outperforms the current state-of-the-art technique by 25% on both body mesh and +3D applied pressure map prediction tasks for people in bed.",cs.CV,['cs.CV'] +OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising,Haichao Zhang · Yi Xu · Hongsheng Lu · Takayuki Shimizu · Yun Fu, ,https://arxiv.org/abs/2404.02227,,2404.02227.pdf,OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising,"Trajectory prediction is fundamental in computer vision and autonomous +driving, particularly for understanding pedestrian behavior and enabling +proactive decision-making. Existing approaches in this field often assume +precise and complete observational data, neglecting the challenges associated +with out-of-view objects and the noise inherent in sensor data due to limited +camera range, physical obstructions, and the absence of ground truth for +denoised sensor data. Such oversights are critical safety concerns, as they can +result in missing essential, non-visible objects. To bridge this gap, we +present a novel method for out-of-sight trajectory prediction that leverages a +vision-positioning technique. Our approach denoises noisy sensor observations +in an unsupervised manner and precisely maps sensor-based trajectories of +out-of-sight objects into visual trajectories. This method has demonstrated +state-of-the-art performance in out-of-sight noisy sensor trajectory denoising +and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory +prediction accuracy and addressing the challenges of out-of-sight objects, our +work significantly contributes to improving the safety and reliability of +autonomous driving in complex environments. Our work represents the first +initiative towards Out-Of-Sight Trajectory prediction (OOSTraj), setting a new +benchmark for future research. The code is available at +\url{https://github.com/Hai-chao-Zhang/OOSTraj}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']" +PAD: Patch-Agnostic Defense against Adversarial Patch Attacks,Lihua Jing · Rui Wang · Wenqi Ren · Xin Dong · Cong Zou, ,https://arxiv.org/abs/2404.16452,,2404.16452.pdf,PAD: Patch-Agnostic Defense against Adversarial Patch Attacks,"Adversarial patch attacks present a significant threat to real-world object +detectors due to their practical feasibility. Existing defense methods, which +rely on attack data or prior knowledge, struggle to effectively address a wide +range of adversarial patches. In this paper, we show two inherent +characteristics of adversarial patches, semantic independence and spatial +heterogeneity, independent of their appearance, shape, size, quantity, and +location. Semantic independence indicates that adversarial patches operate +autonomously within their semantic context, while spatial heterogeneity +manifests as distinct image quality of the patch area that differs from +original clean image due to the independent generation process. Based on these +observations, we propose PAD, a novel adversarial patch localization and +removal method that does not require prior knowledge or additional training. +PAD offers patch-agnostic defense against various adversarial patches, +compatible with any pre-trained object detectors. Our comprehensive digital and +physical experiments involving diverse patch types, such as localized noise, +printable, and naturalistic patches, exhibit notable improvements over +state-of-the-art works. Our code is available at +https://github.com/Lihua-Jing/PAD.",cs.CV,['cs.CV'] +ESR-NeRF: Emissive Source Reconstruction Using LDR Multi-view Images,Jinseo Jeong · Junseo Koo · Qimeng Zhang · Gunhee Kim, ,https://arxiv.org/abs/2404.15707,,2404.15707.pdf,ESR-NeRF: Emissive Source Reconstruction Using LDR Multi-view Images,"Existing NeRF-based inverse rendering methods suppose that scenes are +exclusively illuminated by distant light sources, neglecting the potential +influence of emissive sources within a scene. In this work, we confront this +limitation using LDR multi-view images captured with emissive sources turned on +and off. Two key issues must be addressed: 1) ambiguity arising from the +limited dynamic range along with unknown lighting details, and 2) the expensive +computational cost in volume rendering to backtrace the paths leading to final +object colors. We present a novel approach, ESR-NeRF, leveraging neural +networks as learnable functions to represent ray-traced fields. By training +networks to satisfy light transport segments, we regulate outgoing radiances, +progressively identifying emissive sources while being aware of reflection +areas. The results on scenes encompassing emissive sources with various +properties demonstrate the superiority of ESR-NeRF in qualitative and +quantitative ways. Our approach also extends its applicability to the scenes +devoid of emissive sources, achieving lower CD metrics on the DTU dataset.",cs.CV,['cs.CV'] +Enhancing Video Super-Resolution via Implicit Resampling-based Alignment,Kai Xu · Ziwei Yu · Xin Wang · Michael Bi Mi · Angela Yao,https://github.com/kai422/IART,https://arxiv.org/html/2305.00163v2,,2305.00163v2.pdf,Enhancing Video Super-Resolution via Implicit Resampling-based Alignment,"In video super-resolution, it is common to use a frame-wise alignment to +support the propagation of information over time. The role of alignment is +well-studied for low-level enhancement in video, but existing works overlook a +critical step -- resampling. We show through extensive experiments that for +alignment to be effective, the resampling should preserve the reference +frequency spectrum while minimizing spatial distortions. However, most existing +works simply use a default choice of bilinear interpolation for resampling even +though bilinear interpolation has a smoothing effect and hinders +super-resolution. From these observations, we propose an implicit +resampling-based alignment. The sampling positions are encoded by a sinusoidal +positional encoding, while the value is estimated with a coordinate network and +a window-based cross-attention. We show that bilinear interpolation inherently +attenuates high-frequency information while an MLP-based coordinate network can +approximate more frequencies. Experiments on synthetic and real-world datasets +show that alignment with our proposed implicit resampling enhances the +performance of state-of-the-art frameworks with minimal impact on both compute +and parameters.",cs.CV,['cs.CV'] +UniPTS: A Unified Framework for Proficient Post-Training Sparsity,JingJing Xie · Yuxin Zhang · Mingbao Lin · ZhiHang Lin · Liujuan Cao · Rongrong Ji, ,https://arxiv.org/abs/2405.18810,,2405.18810.pdf,UniPTS: A Unified Framework for Proficient Post-Training Sparsity,"Post-training Sparsity (PTS) is a recently emerged avenue that chases +efficient network sparsity with limited data in need. Existing PTS methods, +however, undergo significant performance degradation compared with traditional +methods that retrain the sparse networks via the whole dataset, especially at +high sparsity ratios. In this paper, we attempt to reconcile this disparity by +transposing three cardinal factors that profoundly alter the performance of +conventional sparsity into the context of PTS. Our endeavors particularly +comprise (1) A base-decayed sparsity objective that promotes efficient +knowledge transferring from dense network to the sparse counterpart. (2) A +reducing-regrowing search algorithm designed to ascertain the optimal sparsity +distribution while circumventing overfitting to the small calibration set in +PTS. (3) The employment of dynamic sparse training predicated on the preceding +aspects, aimed at comprehensively optimizing the sparsity structure while +ensuring training stability. Our proposed framework, termed UniPTS, is +validated to be much superior to existing PTS methods across extensive +benchmarks. As an illustration, it amplifies the performance of POT, a recently +proposed recipe, from 3.9% to 68.6% when pruning ResNet-50 at 90% sparsity +ratio on ImageNet. We release the code of our paper at +https://github.com/xjjxmu/UniPTS.",cs.CV,"['cs.CV', 'cs.AI']" +MMA-Diffusion: MultiModal Attack on Diffusion Models,Yijun Yang · Ruiyuan Gao · Xiaosen Wang · Tsung-Yi Ho · Xu Nan · Qiang Xu,https://github.com/cure-lab/MMA-Diffusion,https://arxiv.org/abs/2311.17516,,2311.17516.pdf,MMA-Diffusion: MultiModal Attack on Diffusion Models,"In recent years, Text-to-Image (T2I) models have seen remarkable +advancements, gaining widespread adoption. However, this progress has +inadvertently opened avenues for potential misuse, particularly in generating +inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces +MMA-Diffusion, a framework that presents a significant and realistic threat to +the security of T2I models by effectively circumventing current defensive +measures in both open-source models and commercial online services. Unlike +previous approaches, MMA-Diffusion leverages both textual and visual modalities +to bypass safeguards like prompt filters and post-hoc safety checkers, thus +exposing and highlighting the vulnerabilities in existing defense mechanisms.",cs.CR,"['cs.CR', 'cs.CV']" +Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures,Huijie Zhang · Yifu Lu · Ismail Alkhouri · Saiprasad Ravishankar · Dogyoon Song · Qing Qu, ,https://arxiv.org/abs/2312.09181,,2312.09181.pdf,Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures,"Diffusion models, emerging as powerful deep generative tools, excel in +various applications. They operate through a two-steps process: introducing +noise into training samples and then employing a model to convert random noise +into new samples (e.g., images). However, their remarkable generative +performance is hindered by slow training and sampling. This is due to the +necessity of tracking extensive forward and reverse diffusion trajectories, and +employing a large model with numerous parameters across multiple timesteps +(i.e., noise levels). To tackle these challenges, we present a multi-stage +framework inspired by our empirical findings. These observations indicate the +advantages of employing distinct parameters tailored to each timestep while +retaining universal parameters shared across all time steps. Our approach +involves segmenting the time interval into multiple stages where we employ +custom multi-decoder U-net architecture that blends time-dependent models with +a universally shared encoder. Our framework enables the efficient distribution +of computational resources and mitigates inter-stage interference, which +substantially improves training efficiency. Extensive numerical experiments +affirm the effectiveness of our framework, showcasing significant training and +sampling efficiency enhancements on three state-of-the-art diffusion models, +including large-scale latent diffusion models. Furthermore, our ablation +studies illustrate the impact of two important components in our framework: (i) +a novel timestep clustering algorithm for stage division, and (ii) an +innovative multi-decoder U-net architecture, seamlessly integrating universal +and customized hyperparameters.",cs.CV,['cs.CV'] +DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing,Chong Mou · Xintao Wang · Jiechong Song · Ying Shan · Jian Zhang, ,https://arxiv.org/abs/2402.02583,,2402.02583.pdf,DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing,"Large-scale Text-to-Image (T2I) diffusion models have revolutionized image +generation over the last few years. Although owning diverse and high-quality +generation capabilities, translating these abilities to fine-grained image +editing remains challenging. In this paper, we propose DiffEditor to rectify +two weaknesses in existing diffusion-based image editing: (1) in complex +scenarios, editing results often lack editing accuracy and exhibit unexpected +artifacts; (2) lack of flexibility to harmonize editing operations, e.g., +imagine new content. In our solution, we introduce image prompts in +fine-grained image editing, cooperating with the text prompt to better describe +the editing content. To increase the flexibility while maintaining content +consistency, we locally combine stochastic differential equation (SDE) into the +ordinary differential equation (ODE) sampling. In addition, we incorporate +regional score-based gradient guidance and a time travel strategy into the +diffusion sampling, further improving the editing quality. Extensive +experiments demonstrate that our method can efficiently achieve +state-of-the-art performance on various fine-grained image editing tasks, +including editing within a single image (e.g., object moving, resizing, and +content dragging) and across images (e.g., appearance replacing and object +pasting). Our source code is released at +https://github.com/MC-E/DragonDiffusion.",cs.CV,"['cs.CV', 'cs.LG']" +DiVAS: Video and Audio Synchronization with Dynamic Frame Rates,Clara Maria Fernandez Labrador · Mertcan Akcay · Eitan Abecassis · Joan Massich · Christopher Schroers, ,,https://link.springer.com/article/10.1007/s11042-023-17728-1,,,,,nan +EvDiG: Event-guided Direct and Global Components Separation,xinyu zhou · Peiqi Duan · Boyu Li · Chu Zhou · Chao Xu · Boxin Shi, ,http://export.arxiv.org/abs/2312.16933v1,,2312.16933v1.pdf,EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion,"Event cameras and RGB cameras exhibit complementary characteristics in +imaging: the former possesses high dynamic range (HDR) and high temporal +resolution, while the latter provides rich texture and color information. This +makes the integration of event cameras into middle- and high-level RGB-based +vision tasks highly promising. However, challenges arise in multi-modal fusion, +data annotation, and model architecture design. In this paper, we propose +EvPlug, which learns a plug-and-play event and image fusion module from the +supervision of the existing RGB-based model. The learned fusion module +integrates event streams with image features in the form of a plug-in, endowing +the RGB-based model to be robust to HDR and fast motion scenes while enabling +high temporal resolution inference. Our method only requires unlabeled +event-image pairs (no pixel-wise alignment required) and does not alter the +structure or weights of the RGB-based model. We demonstrate the superiority of +EvPlug in several vision tasks such as object detection, semantic segmentation, +and 3D hand pose estimation",cs.CV,"['cs.CV', 'cs.AI']" +Face2Diffusion for Fast and Editable Face Personalization,Kaede Shiohara · Toshihiko Yamasaki,https://mapooon.github.io/Face2DiffusionPage/,https://arxiv.org/abs/2403.05094,,2403.05094.pdf,Face2Diffusion for Fast and Editable Face Personalization,"Face personalization aims to insert specific faces, taken from images, into +pretrained text-to-image diffusion models. However, it is still challenging for +previous methods to preserve both the identity similarity and editability due +to overfitting to training samples. In this paper, we propose Face2Diffusion +(F2D) for high-editability face personalization. The core idea behind F2D is +that removing identity-irrelevant information from the training pipeline +prevents the overfitting problem and improves editability of encoded faces. F2D +consists of the following three novel components: 1) Multi-scale identity +encoder provides well-disentangled identity features while keeping the benefits +of multi-scale information, which improves the diversity of camera poses. 2) +Expression guidance disentangles face expressions from identities and improves +the controllability of face expressions. 3) Class-guided denoising +regularization encourages models to learn how faces should be denoised, which +boosts the text-alignment of backgrounds. Extensive experiments on the +FaceForensics++ dataset and diverse prompts demonstrate our method greatly +improves the trade-off between the identity- and text-fidelity compared to +previous state-of-the-art methods.",cs.CV,['cs.CV'] +PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition,Haosong Zhang · Mei Leong · Liyuan Li · Weisi Lin, ,https://ar5iv.labs.arxiv.org/html/2205.11169,,2205.11169.pdf,PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models,"Vision-language pre-training (VLP) has shown impressive performance on a wide +range of cross-modal tasks, where VLP models without reliance on object +detectors are becoming the mainstream due to their superior computation +efficiency and competitive performance. However, the removal of object +detectors also deprives the capability of VLP models in explicit object +modeling, which is essential to various position-sensitive vision-language (VL) +tasks, such as referring expression comprehension and visual commonsense +reasoning. To address the challenge, we introduce PEVL that enhances the +pre-training and prompt tuning of VLP models with explicit object position +modeling. Specifically, PEVL reformulates discretized object positions and +language in a unified language modeling framework, which facilitates explicit +VL alignment during pre-training, and also enables flexible prompt tuning for +various downstream tasks. We show that PEVL enables state-of-the-art +performance of detector-free VLP models on position-sensitive tasks such as +referring expression comprehension and phrase grounding, and also improves the +performance on position-insensitive tasks with grounded inputs. We make the +data and code for this paper publicly available at +https://github.com/thunlp/PEVL.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +HOI-M$^3$: Capture Multiple Humans and Objects Interaction within Contextual Environment,Juze Zhang · Jingyan Zhang · Zining Song · Zhanhe Shi · Chengfeng Zhao · Ye Shi · Jingyi Yu · Lan Xu · Jingya Wang, ,https://arxiv.org/abs/2404.00299,,2404.00299.pdf,HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment,"Humans naturally interact with both others and the surrounding multiple +objects, engaging in various social activities. However, recent advances in +modeling human-object interactions mostly focus on perceiving isolated +individuals and objects, due to fundamental data scarcity. In this paper, we +introduce HOI-M3, a novel large-scale dataset for modeling the interactions of +Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking +for both humans and objects from dense RGB and object-mounted IMU inputs, +covering 199 sequences and 181M frames of diverse humans and objects under rich +activities. With the unique HOI-M3 dataset, we introduce two novel data-driven +tasks with companion strong baselines: monocular capture and unstructured +generation of multiple human-object interactions. Extensive experiments +demonstrate that our dataset is challenging and worthy of further research +about multiple human-object interactions and behavior analysis. Our HOI-M3 +dataset, corresponding codes, and pre-trained models will be disseminated to +the community for future research.",cs.CV,['cs.CV'] +GraCo: Granularity-Controllable Interactive Segmentation,Yian Zhao · Kehan Li · Zesen Cheng · Pengchong Qiao · Xiawu Zheng · Rongrong Ji · Chang Liu · Li Yuan · Jie Chen, ,https://arxiv.org/abs/2405.00587,,2405.00587.pdf,GraCo: Granularity-Controllable Interactive Segmentation,"Interactive Segmentation (IS) segments specific objects or parts in the image +according to user input. Current IS pipelines fall into two categories: +single-granularity output and multi-granularity output. The latter aims to +alleviate the spatial ambiguity present in the former. However, the +multi-granularity output pipeline suffers from limited interaction flexibility +and produces redundant results. In this work, we introduce +Granularity-Controllable Interactive Segmentation (GraCo), a novel approach +that allows precise control of prediction granularity by introducing additional +parameters to input. This enhances the customization of the interactive system +and eliminates redundancy while resolving ambiguity. Nevertheless, the +exorbitant cost of annotating multi-granularity masks and the lack of available +datasets with granularity annotations make it difficult for models to acquire +the necessary guidance to control output granularity. To address this problem, +we design an any-granularity mask generator that exploits the semantic property +of the pre-trained IS model to automatically generate abundant mask-granularity +pairs without requiring additional manual annotation. Based on these pairs, we +propose a granularity-controllable learning strategy that efficiently imparts +the granularity controllability to the IS model. Extensive experiments on +intricate scenarios at object and part levels demonstrate that our GraCo has +significant advantages over previous methods. This highlights the potential of +GraCo to be a flexible annotation tool, capable of adapting to diverse +segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo.",cs.CV,['cs.CV'] +Deep Equilibrium Diffusion Restoration with Parallel Sampling,Jiezhang Cao · Yue Shi · Kai Zhang · Yulun Zhang · Radu Timofte · Luc Van Gool, ,https://arxiv.org/abs/2311.11600,,2311.11600.pdf,Deep Equilibrium Diffusion Restoration with Parallel Sampling,"Diffusion model-based image restoration (IR) aims to use diffusion models to +recover high-quality (HQ) images from degraded images, achieving promising +performance. Due to the inherent property of diffusion models, most existing +methods need long serial sampling chains to restore HQ images step-by-step, +resulting in expensive sampling time and high computation costs. Moreover, such +long sampling chains hinder understanding the relationship between inputs and +restoration results since it is hard to compute the gradients in the whole +chains. In this work, we aim to rethink the diffusion model-based IR models +through a different perspective, i.e., a deep equilibrium (DEQ) fixed point +system, called DeqIR. Specifically, we derive an analytical solution by +modeling the entire sampling chain in these IR models as a joint multivariate +fixed point system. Based on the analytical solution, we can conduct parallel +sampling and restore HQ images without training. Furthermore, we compute fast +gradients via DEQ inversion and found that initialization optimization can +boost image quality and control the generation direction. Extensive experiments +on benchmarks demonstrate the effectiveness of our method on typical IR tasks +and real-world settings.",cs.CV,['cs.CV'] +Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network,Sizhe Zheng · Pan Gao · Peng Zhou · Jie Qin, ,https://arxiv.org/abs/2405.19775,,2405.19775.pdf,Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network,"Style transfer aims to render an image with the artistic features of a style +image, while maintaining the original structure. Various methods have been put +forward for this task, but some challenges still exist. For instance, it is +difficult for CNN-based methods to handle global information and long-range +dependencies between input images, for which transformer-based methods have +been proposed. Although transformers can better model the relationship between +content and style images, they require high-cost hardware and time-consuming +inference. To address these issues, we design a novel transformer model that +includes only the encoder, thus significantly reducing the computational cost. +In addition, we also find that existing style transfer methods may lead to +images under-stylied or missing content. In order to achieve better +stylization, we design a content feature extractor and a style feature +extractor, based on which pure content and style images can be fed to the +transformer. Finally, we propose a novel network termed Puff-Net, i.e., pure +content and style feature fusion network. Through qualitative and quantitative +experiments, we demonstrate the advantages of our model compared to +state-of-the-art ones in the literature.",cs.CV,['cs.CV'] +Tactile-Augmented Radiance Fields,Yiming Dou · Fengyu Yang · Yi Liu · Antonio Loquercio · Andrew Owens, ,https://arxiv.org/abs/2405.04534,,2405.04534.pdf,Tactile-Augmented Radiance Fields,"We present a scene representation, which we call a tactile-augmented radiance +field (TaRF), that brings vision and touch into a shared 3D space. This +representation can be used to estimate the visual and tactile signals for a +given 3D position within a scene. We capture a scene's TaRF from a collection +of photos and sparsely sampled touch probes. Our approach makes use of two +insights: (i) common vision-based touch sensors are built on ordinary cameras +and thus can be registered to images using methods from multi-view geometry, +and (ii) visually and structurally similar regions of a scene share the same +tactile features. We use these insights to register touch signals to a captured +visual scene, and to train a conditional diffusion model that, provided with an +RGB-D image rendered from a neural radiance field, generates its corresponding +tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This +dataset contains more touch samples than previous real-world datasets, and it +provides spatially aligned visual signals for each captured touch signal. We +demonstrate the accuracy of our cross-modal generative model and the utility of +the captured visual-tactile data on several downstream tasks. Project page: +https://dou-yiming.github.io/TaRF",cs.CV,['cs.CV'] +The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes,Myeongseob Ko · Feiyang Kang · Weiyan Shi · Ming Jin · Zhou Yu · Ruoxi Jia, ,https://arxiv.org/abs/2402.08922,,2402.08922.pdf,The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes,"Large-scale black-box models have become ubiquitous across numerous +applications. Understanding the influence of individual training data sources +on predictions made by these models is crucial for improving their +trustworthiness. Current influence estimation techniques involve computing +gradients for every training point or repeated training on different subsets. +These approaches face obvious computational challenges when scaled up to large +datasets and models. + In this paper, we introduce and explore the Mirrored Influence Hypothesis, +highlighting a reciprocal nature of influence between training and test data. +Specifically, it suggests that evaluating the influence of training data on +test predictions can be reformulated as an equivalent, yet inverse problem: +assessing how the predictions for training samples would be altered if the +model were trained on specific test samples. Through both empirical and +theoretical validations, we demonstrate the wide applicability of our +hypothesis. Inspired by this, we introduce a new method for estimating the +influence of training data, which requires calculating gradients for specific +test samples, paired with a forward pass for each training point. This approach +can capitalize on the common asymmetry in scenarios where the number of test +samples under concurrent examination is much smaller than the scale of the +training dataset, thus gaining a significant improvement in efficiency compared +to existing approaches. + We demonstrate the applicability of our method across a range of scenarios, +including data attribution in diffusion models, data leakage detection, +analysis of memorization, mislabeled data detection, and tracing behavior in +language models. Our code will be made available at +https://github.com/ruoxi-jia-group/Forward-INF.",cs.LG,"['cs.LG', 'stat.ML']" +Logit Standardization in Knowledge Distillation,Shangquan Sun · Wenqi Ren · Jingzhi Li · Rui Wang · Xiaochun Cao,https://sunsean21.github.io/logit-stand-KD.html,https://arxiv.org/abs/2403.01427,,2403.01427.pdf,Logit Standardization in Knowledge Distillation,"Knowledge distillation involves transferring soft labels from a teacher to a +student using a shared temperature-based softmax function. However, the +assumption of a shared temperature between teacher and student implies a +mandatory exact match between their logits in terms of logit range and +variance. This side-effect limits the performance of student, considering the +capacity discrepancy between them and the finding that the innate logit +relations of teacher are sufficient for student to learn. To address this +issue, we propose setting the temperature as the weighted standard deviation of +logit and performing a plug-and-play Z-score pre-process of logit +standardization before applying softmax and Kullback-Leibler divergence. Our +pre-process enables student to focus on essential logit relations from teacher +rather than requiring a magnitude match, and can improve the performance of +existing logit-based distillation methods. We also show a typical case where +the conventional setting of sharing temperature between teacher and student +cannot reliably yield the authentic distillation evaluation; nonetheless, this +challenge is successfully alleviated by our Z-score. We extensively evaluate +our method for various student and teacher models on CIFAR-100 and ImageNet, +showing its significant superiority. The vanilla knowledge distillation powered +by our pre-process can achieve favorable performance against state-of-the-art +methods, and other distillation variants can obtain considerable gain with the +assistance of our pre-process.",cs.CV,['cs.CV'] +Fourier Priors-Guided Diffusion for Zero-Shot Joint Low-Light Enhancement and Deblurring,Xiaoqian Lv · Shengping Zhang · Chenyang Wang · Yichen Zheng · Bineng Zhong · Chongyi Li · Liqiang Nie, ,,https://www.sciencedirect.com/science/article/abs/pii/S0957417424005888,,,,,nan +Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance,Dazhong Shen · Guanglu Song · Zeyue Xue · Fu-Yun Wang · Yu Liu, ,https://arxiv.org/abs/2404.05384,,2404.05384.pdf,Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance,"Classifier-Free Guidance (CFG) has been widely used in text-to-image +diffusion models, where the CFG scale is introduced to control the strength of +text guidance on the whole image space. However, we argue that a global CFG +scale results in spatial inconsistency on varying semantic strengths and +suboptimal image quality. To address this problem, we present a novel approach, +Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance +degrees for different semantic units in text-to-image diffusion models. +Specifically, we first design a training-free semantic segmentation method to +partition the latent image into relatively independent semantic regions at each +denoising step. In particular, the cross-attention map in the denoising U-net +backbone is renormalized for assigning each patch to the corresponding token, +while the self-attention map is used to complete the semantic regions. Then, to +balance the amplification of diverse semantic units, we adaptively adjust the +CFG scales across different semantic regions to rescale the text guidance +degrees into a uniform level. Finally, extensive experiments demonstrate the +superiority of S-CFG over the original CFG strategy on various text-to-image +diffusion models, without requiring any extra training cost. our codes are +available at https://github.com/SmilesDZgk/S-CFG.",cs.CV,"['cs.CV', 'cs.AI']" +Scalable 3D Registration via Truncated Entry-wise Absolute Residuals,Tianyu Huang · Liangzu Peng · Rene Vidal · Yun-Hui Liu, ,https://arxiv.org/abs/2404.00915,,2404.00915.pdf,Scalable 3D Registration via Truncated Entry-wise Absolute Residuals,"Given an input set of $3$D point pairs, the goal of outlier-robust $3$D +registration is to compute some rotation and translation that align as many +point pairs as possible. This is an important problem in computer vision, for +which many highly accurate approaches have been recently proposed. Despite +their impressive performance, these approaches lack scalability, often +overflowing the $16$GB of memory of a standard laptop to handle roughly +$30,000$ point pairs. In this paper, we propose a $3$D registration approach +that can process more than ten million ($10^7$) point pairs with over $99\%$ +random outliers. Moreover, our method is efficient, entails low memory costs, +and maintains high accuracy at the same time. We call our method TEAR, as it +involves minimizing an outlier-robust loss that computes Truncated Entry-wise +Absolute Residuals. To minimize this loss, we decompose the original +$6$-dimensional problem into two subproblems of dimensions $3$ and $2$, +respectively, solved in succession to global optimality via a customized +branch-and-bound method. While branch-and-bound is often slow and unscalable, +this does not apply to TEAR as we propose novel bounding functions that are +tight and computationally efficient. Experiments on various datasets are +conducted to validate the scalability and efficiency of our method.",cs.CV,"['cs.CV', 'cs.RO']" +Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching,Matteo Bastico · Etienne Decencière · Laurent Corté · Yannick TILLIER · David Ryckelynck,https://github.com/matteo-bastico/CoupLap,,https://paperswithcode.com/paper/coupled-laplacian-eigenmaps-for-locally-aware,,,,,nan +MatFuse: Controllable Material Generation with Diffusion Models,Giuseppe Vecchio · Renato Sortino · Simone Palazzo · Concetto Spampinato,https://gvecchio.com/matfuse/,https://arxiv.org/abs/2308.11408,,2308.11408.pdf,MatFuse: Controllable Material Generation with Diffusion Models,"Creating high-quality materials in computer graphics is a challenging and +time-consuming task, which requires great expertise. To simplify this process, +we introduce MatFuse, a unified approach that harnesses the generative power of +diffusion models for creation and editing of 3D materials. Our method +integrates multiple sources of conditioning, including color palettes, +sketches, text, and pictures, enhancing creative possibilities and granting +fine-grained control over material synthesis. Additionally, MatFuse enables +map-level material editing capabilities through latent manipulation by means of +a multi-encoder compression model which learns a disentangled latent +representation for each map. We demonstrate the effectiveness of MatFuse under +multiple conditioning settings and explore the potential of material editing. +Finally, we assess the quality of the generated materials both quantitatively +in terms of CLIP-IQA and FID scores and qualitatively by conducting a user +study. Source code for training MatFuse and supplemental materials are publicly +available at https://gvecchio.com/matfuse.",cs.CV,"['cs.CV', 'cs.GR']" +Continuous Optical Zooming: A Benchmark for Arbitrary-Scale Image Super-Resolution in Real World,Huiyuan Fu · Fei Peng · Xianwei Li · Yejun Li · Xin Wang · Huadong Ma, ,,https://github.com/Weepingchestnut/Arbitrary-Scale-SR,,,,,nan +DETRs Beat YOLOs on Real-time Object Detection,Yian Zhao · Wenyu Lv · Shangliang Xu · Jinman Wei · Guanzhong Wang · Qingqing Dang · Yi Liu · Jie Chen, ,https://arxiv.org/html/2304.08069v3,,2304.08069v3.pdf,DETRs Beat YOLOs on Real-time Object Detection,"The YOLO series has become the most popular framework for real-time object +detection due to its reasonable trade-off between speed and accuracy. However, +we observe that the speed and accuracy of YOLOs are negatively affected by the +NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an +alternative to eliminating NMS. Nevertheless, the high computational cost +limits their practicality and hinders them from fully exploiting the advantage +of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer +(RT-DETR), the first real-time end-to-end object detector to our best knowledge +that addresses the above dilemma. We build RT-DETR in two steps, drawing on the +advanced DETR: first we focus on maintaining accuracy while improving speed, +followed by maintaining speed while improving accuracy. Specifically, we design +an efficient hybrid encoder to expeditiously process multi-scale features by +decoupling intra-scale interaction and cross-scale fusion to improve speed. +Then, we propose the uncertainty-minimal query selection to provide +high-quality initial queries to the decoder, thereby improving accuracy. In +addition, RT-DETR supports flexible speed tuning by adjusting the number of +decoder layers to adapt to various scenarios without retraining. Our +RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 +GPU, outperforming previously advanced YOLOs in both speed and accuracy. We +also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and +M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy +and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 / +R101 achieves 55.3% / 56.2% AP. The project page: +https://zhao-yian.github.io/RTDETR.",cs.CV,['cs.CV'] +Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling,Kranthi Kumar Rachavarapu · Kalyan Ramakrishnan · A. N. Rajagopalan, ,https://arxiv.org/abs/2405.10690,,2405.10690.pdf,CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing,"Weakly supervised audio-visual video parsing (AVVP) methods aim to detect +audible-only, visible-only, and audible-visible events using only video-level +labels. Existing approaches tackle this by leveraging unimodal and cross-modal +contexts. However, we argue that while cross-modal learning is beneficial for +detecting audible-visible events, in the weakly supervised scenario, it +negatively impacts unaligned audible or visible events by introducing +irrelevant modality information. In this paper, we propose CoLeaF, a novel +learning framework that optimizes the integration of cross-modal context in the +embedding space such that the network explicitly learns to combine cross-modal +information for audible-visible events while filtering them out for unaligned +events. Additionally, as videos often involve complex class relationships, +modelling them improves performance. However, this introduces extra +computational costs into the network. Our framework is designed to leverage +cross-class relationships during training without incurring additional +computations at inference. Furthermore, we propose new metrics to better +evaluate a method's capabilities in performing AVVP. Our extensive experiments +demonstrate that CoLeaF significantly improves the state-of-the-art results by +an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, +respectively.",cs.CV,['cs.CV'] +Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning,Rui Zhao · Bin Shi · Jianfei Ruan · Tianze Pan · Bo Dong,https://github.com/RyanZhaoIc/PLM.git,https://arxiv.org/abs/2405.05714,,2405.05714.pdf,Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning,"In noisy label learning, estimating noisy class posteriors plays a +fundamental role for developing consistent classifiers, as it forms the basis +for estimating clean class posteriors and the transition matrix. Existing +methods typically learn noisy class posteriors by training a classification +model with noisy labels. However, when labels are incorrect, these models may +be misled to overemphasize the feature parts that do not reflect the instance +characteristics, resulting in significant errors in estimating noisy class +posteriors. To address this issue, this paper proposes to augment the +supervised information with part-level labels, encouraging the model to focus +on and integrate richer information from various parts. Specifically, our +method first partitions features into distinct parts by cropping instances, +yielding part-level labels associated with these various parts. Subsequently, +we introduce a novel single-to-multiple transition matrix to model the +relationship between the noisy and part-level labels, which incorporates +part-level labels into a classifier-consistent framework. Utilizing this +framework with part-level labels, we can learn the noisy class posteriors more +precisely by guiding the model to integrate information from various parts, +ultimately improving the classification performance. Our method is +theoretically sound, while experiments show that it is empirically effective in +synthetic and real-world noisy benchmarks.",cs.CV,"['cs.CV', 'cs.LG']" +CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification,Yiyu Chen · Zheyi Fan · Zhaoru Chen · Yixuan Zhu, ,https://arxiv.org/abs/2311.10605,,2311.10605.pdf,CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification,"Person re-identification (re-ID) is a challenging task that aims to learn +discriminative features for person retrieval. In person re-ID, Jaccard distance +is a widely used distance metric, especially in re-ranking and clustering +scenarios. However, we discover that camera variation has a significant +negative impact on the reliability of Jaccard distance. In particular, Jaccard +distance calculates the distance based on the overlap of relevant neighbors. +Due to camera variation, intra-camera samples dominate the relevant neighbors, +which reduces the reliability of the neighbors by introducing intra-camera +negative samples and excluding inter-camera positive samples. To overcome this +problem, we propose a novel camera-aware Jaccard (CA-Jaccard) distance that +leverages camera information to enhance the reliability of Jaccard distance. +Specifically, we design camera-aware k-reciprocal nearest neighbors (CKRNNs) to +find k-reciprocal nearest neighbors on the intra-camera and inter-camera +ranking lists, which improves the reliability of relevant neighbors and +guarantees the contribution of inter-camera samples in the overlap. Moreover, +we propose a camera-aware local query expansion (CLQE) to mine reliable samples +in relevant neighbors by exploiting camera variation as a strong constraint and +assign these samples higher weights in overlap, further improving the +reliability. Our CA-Jaccard distance is simple yet effective and can serve as a +general distance metric for person re-ID methods with high reliability and low +computational cost. Extensive experiments demonstrate the effectiveness of our +method.",cs.CV,['cs.CV'] +SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation,Aysim Toker · Marvin Eisenberger · Daniel Cremers · Laura Leal-Taixe, ,https://arxiv.org/abs/2403.16605,,2403.16605.pdf,SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation,"In recent years, semantic segmentation has become a pivotal tool in +processing and interpreting satellite imagery. Yet, a prevalent limitation of +supervised learning techniques remains the need for extensive manual +annotations by experts. In this work, we explore the potential of generative +image diffusion to address the scarcity of annotated data in earth observation +tasks. The main idea is to learn the joint data manifold of images and labels, +leveraging recent advancements in denoising diffusion probabilistic models. To +the best of our knowledge, we are the first to generate both images and +corresponding masks for satellite segmentation. We find that the obtained pairs +not only display high quality in fine-scale features but also ensure a wide +sampling diversity. Both aspects are crucial for earth observation data, where +semantic classes can vary severely in scale and occurrence frequency. We employ +the novel data instances for downstream segmentation, as a form of data +augmentation. In our experiments, we provide comparisons to prior works based +on discriminative diffusion models or GANs. We demonstrate that integrating +generated samples yields significant quantitative improvements for satellite +semantic segmentation -- both compared to baselines and when training only on +the original data.",cs.CV,['cs.CV'] +EASE-DETR: Easing the Competition among Object Queries,Yulu Gao · Yifan Sun · Xudong Ding · Chuyang Zhao · Si Liu, ,https://arxiv.org/abs/2310.08854,,2310.08854.pdf,Rank-DETR for High Quality Object Detection,"Modern detection transformers (DETRs) use a set of object queries to predict +a list of bounding boxes, sort them by their classification confidence scores, +and select the top-ranked predictions as the final detection results for the +given input image. A highly performant object detector requires accurate +ranking for the bounding box predictions. For DETR-based detectors, the +top-ranked bounding boxes suffer from less accurate localization quality due to +the misalignment between classification scores and localization accuracy, thus +impeding the construction of high-quality detectors. In this work, we introduce +a simple and highly performant DETR-based object detector by proposing a series +of rank-oriented designs, combinedly called Rank-DETR. Our key contributions +include: (i) a rank-oriented architecture design that can prompt positive +predictions and suppress the negative ones to ensure lower false positive +rates, as well as (ii) a rank-oriented loss function and matching cost design +that prioritizes predictions of more accurate localization accuracy during +ranking to boost the AP under high IoU thresholds. We apply our method to +improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong +COCO object detection results when using different backbones such as +ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our +approach. Code is available at \url{https://github.com/LeapLabTHU/Rank-DETR}.",cs.CV,"['cs.CV', 'cs.LG']" +Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation,Razvan Pasca · Alexey Gavryushin · Muhammad Hamza · Yen-Ling Kuo · Kaichun Mo · Luc Van Gool · Otmar Hilliges · Xi Wang, ,,https://dblp.org/rec/journals/corr/abs-2301-09209,,,,,nan +LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs,Yunsheng Ma · Can Cui · Xu Cao · Wenqian Ye · Peiran Liu · Juanwu Lu · Amr Abdelraouf · Rohit Gupta · Kyungtae Han · Aniket Bera · James Rehg · Ziran Wang, ,https://arxiv.org/abs/2312.04372v2,,2312.04372v2.pdf,LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs,"Autonomous driving (AD) has made significant strides in recent years. +However, existing frameworks struggle to interpret and execute spontaneous user +instructions, such as ""overtake the car ahead."" Large Language Models (LLMs) +have demonstrated impressive reasoning capabilities showing potential to bridge +this gap. In this paper, we present LaMPilot, a novel framework that integrates +LLMs into AD systems, enabling them to follow user instructions by generating +code that leverages established functional primitives. We also introduce +LaMPilot-Bench, the first benchmark dataset specifically designed to +quantitatively evaluate the efficacy of language model programs in AD. Adopting +the LaMPilot framework, we conduct extensive experiments to assess the +performance of off-the-shelf LLMs on LaMPilot-Bench. Our results demonstrate +the potential of LLMs in handling diverse driving scenarios and following user +instructions in driving. To facilitate further research in this area, we +release our code and data at https://github.com/PurdueDigitalTwin/LaMPilot.",cs.CL,"['cs.CL', 'cs.AI']" +C3: High-performance and low-complexity neural compression from a single image or video,Hyunjik Kim · Matthias Bauer · Lucas Theis · Jonathan Richard Schwarz · Emilien Dupont, ,https://arxiv.org/abs/2312.02753,,2312.02753.pdf,C3: High-performance and low-complexity neural compression from a single image or video,"Most neural compression models are trained on large datasets of images or +videos in order to generalize to unseen data. Such generalization typically +requires large and expressive architectures with a high decoding complexity. +Here we introduce C3, a neural compression method with strong rate-distortion +(RD) performance that instead overfits a small model to each image or video +separately. The resulting decoding complexity of C3 can be an order of +magnitude lower than neural baselines with similar RD performance. C3 builds on +COOL-CHIC (Ladune et al.) and makes several simple and effective improvements +for images. We further develop new methodology to apply C3 to videos. On the +CLIC2020 image benchmark, we match the RD performance of VTM, the reference +implementation of the H.266 codec, with less than 3k MACs/pixel for decoding. +On the UVG video benchmark, we match the RD performance of the Video +Compression Transformer (Mentzer et al.), a well-established neural video +codec, with less than 5k MACs/pixel for decoding.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG', 'stat.ML']" +Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture,Juanwu Lu · Can Cui · Yunsheng Ma · Aniket Bera · Ziran Wang, ,https://arxiv.org/abs/2404.03789,,2404.03789.pdf,Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture,"Safety and robustness are crucial factors in developing trustworthy +autonomous vehicles. One essential aspect of addressing these factors is to +equip vehicles with the capability to predict future trajectories for all +moving objects in the surroundings and quantify prediction uncertainties. In +this paper, we propose the Sequential Neural Variational Agent (SeNeVA), a +generative model that describes the distribution of future trajectories for a +single moving object. Our approach can distinguish Out-of-Distribution data +while quantifying uncertainty and achieving competitive performance compared to +state-of-the-art methods on the Argoverse 2 and INTERACTION datasets. +Specifically, a 0.446 meters minimum Final Displacement Error, a 0.203 meters +minimum Average Displacement Error, and a 5.35% Miss Rate are achieved on the +INTERACTION test set. Extensive qualitative and quantitative analysis is also +provided to evaluate the proposed model. Our open-source code is available at +https://github.com/PurdueDigitalTwin/seneva.",cs.CV,"['cs.CV', 'cs.AI']" +Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers,Sanghyeok Lee · Joonmyung Choi · Hyunwoo J. Kim,https://github.com/mlvlab/MCTF,https://arxiv.org/abs/2403.10030,,2403.10030.pdf,Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers,"Vision Transformer (ViT) has emerged as a prominent backbone for computer +vision. For more efficient ViTs, recent works lessen the quadratic cost of the +self-attention layer by pruning or fusing the redundant tokens. However, these +works faced the speed-accuracy trade-off caused by the loss of information. +Here, we argue that token fusion needs to consider diverse relations between +tokens to minimize information loss. In this paper, we propose a Multi-criteria +Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria +(e.g., similarity, informativeness, and size of fused tokens). Further, we +utilize the one-step-ahead attention, which is the improved approach to capture +the informativeness of the tokens. By training the model equipped with MCTF +using a token reduction consistency, we achieve the best speed-accuracy +trade-off in the image classification (ImageNet1K). Experimental results prove +that MCTF consistently surpasses the previous reduction methods with and +without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by +about 44% while improving the performance (+0.5%, and +0.3%) over the base +model, respectively. We also demonstrate the applicability of MCTF in various +Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup +without performance degradation. Code is available at +https://github.com/mlvlab/MCTF.",cs.CV,['cs.CV'] +Fooling Polarization-based Vision using Locally Controllable Polarizing Projection,Zhuoxiao Li · Zhihang Zhong · Shohei Nobuhara · Ko Nishino · Yinqiang Zheng, ,,https://paperswithcode.com/search?q=author:Ko+Nishino,,,,,nan +MuGE: Multiple Granularity Edge Detection,Caixia Zhou · Yaping Huang · Mengyang Pu · Qingji Guan · Ruoxi Deng · Haibin Ling, ,,https://www.semanticscholar.org/paper/Practical-Edge-Detection-via-Robust-Collaborative-Fu-Guo/1b7f58d62ac5bcb292da96863482ade8348c9534,,,,,nan +Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos,Leonhard Sommer · Artur Jesslen · Eddy Ilg · Adam Kortylewski, ,https://arxiv.org/abs/2404.05626,,2404.05626.pdf,Learning a Category-level Object Pose Estimator without Pose Annotations,"3D object pose estimation is a challenging task. Previous works always +require thousands of object images with annotated poses for learning the 3D +pose correspondence, which is laborious and time-consuming for labeling. In +this paper, we propose to learn a category-level 3D object pose estimator +without pose annotations. Instead of using manually annotated images, we +leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under +controlled pose differences and propose to learn our object pose estimator with +those images. Directly using the original diffusion model leads to images with +noisy poses and artifacts. To tackle this issue, firstly, we exploit an image +encoder, which is learned from a specially designed contrastive pose learning, +to filter the unreasonable details and extract image feature maps. +Additionally, we propose a novel learning strategy that allows the model to +learn object poses from those generated image sets without knowing the +alignment of their canonical poses. Experimental results show that our method +has the capability of category-level object pose estimation from a single shot +setting (as pose definition), while significantly outperforming other +state-of-the-art methods on the few-shot category-level object pose estimation +benchmarks.",cs.CV,['cs.CV'] +Long-Tailed Anomaly Detection with Learnable Class Names,Chih-Hui Ho · Kuan-Chuan Peng · Nuno Vasconcelos,http://www.svcl.ucsd.edu/projects/ltad/,https://arxiv.org/abs/2403.20236,,,Long-Tailed Anomaly Detection with Learnable Class Names,"Anomaly detection (AD) aims to identify defective images and localize their +defects (if any). Ideally, AD models should be able to detect defects over many +image classes; without relying on hard-coded class names that can be +uninformative or inconsistent across datasets; learn without anomaly +supervision; and be robust to the long-tailed distributions of real-world +applications. To address these challenges, we formulate the problem of +long-tailed AD by introducing several datasets with different levels of class +imbalance and metrics for performance evaluation. We then propose a novel +method, LTAD, to detect defects from multiple and long-tailed classes, without +relying on dataset class names. LTAD combines AD by reconstruction and semantic +AD modules. AD by reconstruction is implemented with a transformer-based +reconstruction module. Semantic AD is implemented with a binary classifier, +which relies on learned pseudo class names and a pretrained foundation model. +These modules are learned over two phases. Phase 1 learns the pseudo-class +names and a variational autoencoder (VAE) for feature synthesis that augments +the training data to combat long-tails. Phase 2 then learns the parameters of +the reconstruction and classification modules of LTAD. Extensive experiments +using the proposed long-tailed datasets show that LTAD substantially +outperforms the state-of-the-art methods for most forms of dataset imbalance. +The long-tailed dataset split is available at +https://zenodo.org/records/10854201 .",cs.CV,['cs.CV'] +DiffusionRegPose: Enhancing Multi-Person Pose Estimation using a Diffusion-Based End-to-End Regression Approach,Dayi Tan · Hansheng Chen · Wei Tian · Lu Xiong, ,https://arxiv.org/abs/2401.04921,,2401.04921.pdf,Diffusion-based Pose Refinement and Muti-hypothesis Generation for 3D Human Pose Estimaiton,"Previous probabilistic models for 3D Human Pose Estimation (3DHPE) aimed to +enhance pose accuracy by generating multiple hypotheses. However, most of the +hypotheses generated deviate substantially from the true pose. Compared to +deterministic models, the excessive uncertainty in probabilistic models leads +to weaker performance in single-hypothesis prediction. To address these two +challenges, we propose a diffusion-based refinement framework called DRPose, +which refines the output of deterministic models by reverse diffusion and +achieves more suitable multi-hypothesis prediction for the current pose +benchmark by multi-step refinement with multiple noises. To this end, we +propose a Scalable Graph Convolution Transformer (SGCT) and a Pose Refinement +Module (PRM) for denoising and refining. Extensive experiments on Human3.6M and +MPI-INF-3DHP datasets demonstrate that our method achieves state-of-the-art +performance on both single and multi-hypothesis 3DHPE. Code is available at +https://github.com/KHB1698/DRPose.",cs.CV,['cs.CV'] +Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning,Woo-Jin Ahn · Geun-Yeong Yang · Hyunduck Choi · Myo-Taeg Lim,https://github.com/root0yang/BlindNet,https://arxiv.org/abs/2403.06122,,2403.06122.pdf,Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning,"Deep learning models for semantic segmentation often experience performance +degradation when deployed to unseen target domains unidentified during the +training phase. This is mainly due to variations in image texture (\ie style) +from different data sources. To tackle this challenge, existing domain +generalized semantic segmentation (DGSS) methods attempt to remove style +variations from the feature. However, these approaches struggle with the +entanglement of style and content, which may lead to the unintentional removal +of crucial content information, causing performance degradation. This study +addresses this limitation by proposing BlindNet, a novel DGSS approach that +blinds the style without external modules or datasets. The main idea behind our +proposed approach is to alleviate the effect of style in the encoder whilst +facilitating robust segmentation in the decoder. To achieve this, BlindNet +comprises two key components: covariance alignment and semantic consistency +contrastive learning. Specifically, the covariance alignment trains the encoder +to uniformly recognize various styles and preserve the content information of +the feature, rather than removing the style-sensitive factor. Meanwhile, +semantic consistency contrastive learning enables the decoder to construct +discriminative class embedding space and disentangles features that are +vulnerable to misclassification. Through extensive experiments, our approach +outperforms existing DGSS methods, exhibiting robustness and superior +performance for semantic segmentation on unseen target domains.",cs.CV,['cs.CV'] +Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack,Sabbir Ahmed · RANYANG ZHOU · Shaahin Angizi · Adnan Rakin Rakin, ,,,,,,,nan +Robust Image Denoising through Adversarial Frequency Mixup,Donghun Ryou · Inju Ha · Hyewon Yoo · Dongwan Kim · Bohyung Han, ,https://arxiv.org/abs/2306.16050,,2306.16050.pdf,Evaluating Similitude and Robustness of Deep Image Denoising Models via Adversarial Attack,"Deep neural networks (DNNs) have shown superior performance comparing to +traditional image denoising algorithms. However, DNNs are inevitably vulnerable +while facing adversarial attacks. In this paper, we propose an adversarial +attack method named denoising-PGD which can successfully attack all the current +deep denoising models while keep the noise distribution almost unchanged. We +surprisingly find that the current mainstream non-blind denoising models +(DnCNN, FFDNet, ECNDNet, BRDNet), blind denoising models (DnCNN-B, Noise2Noise, +RDDCNN-B, FAN), plug-and-play (DPIR, CurvPnP) and unfolding denoising models +(DeamNet) almost share the same adversarial sample set on both grayscale and +color images, respectively. Shared adversarial sample set indicates that all +these models are similar in term of local behaviors at the neighborhood of all +the test samples. Thus, we further propose an indicator to measure the local +similarity of models, called robustness similitude. Non-blind denoising models +are found to have high robustness similitude across each other, while +hybrid-driven models are also found to have high robustness similitude with +pure data-driven non-blind denoising models. According to our robustness +assessment, data-driven non-blind denoising models are the most robust. We use +adversarial training to complement the vulnerability to adversarial attacks. +Moreover, the model-driven image denoising BM3D shows resistance on adversarial +attacks.",cs.CV,"['cs.CV', 'cs.LG', 'eess.IV']" +Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction,Jianping Jiang · xinyu zhou · Bingxuan Wang · Xiaoming Deng · Chao Xu · Boxin Shi, ,https://arxiv.org/abs/2403.07346v1,,2403.07346v1.pdf,Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction,"Reliable hand mesh reconstruction (HMR) from commonly-used color and depth +sensors is challenging especially under scenarios with varied illuminations and +fast motions. Event camera is a highly promising alternative for its high +dynamic range and dense temporal resolution properties, but it lacks key +texture appearance for hand mesh reconstruction. In this paper, we propose +EvRGBHand -- the first approach for 3D hand mesh reconstruction with an event +camera and an RGB camera compensating for each other. By fusing two modalities +of data across time, space, and information dimensions,EvRGBHand can tackle +overexposure and motion blur issues in RGB-based HMR and foreground scarcity +and background overflow issues in event-based HMR. We further propose +EvRGBDegrader, which allows our model to generalize effectively in challenging +scenes, even when trained solely on standard scenes, thus reducing data +acquisition costs. Experiments on real-world data demonstrate that EvRGBHand +can effectively solve the challenging issues when using either type of camera +alone via retaining the merits of both, and shows the potential of +generalization to outdoor scenes and another type of event camera.",cs.CV,['cs.CV'] +Friendly Sharpness-Aware Minimization,Tao Li · Pan Zhou · Zhengbao He · Xinwen Cheng · Xiaolin Huang, ,https://arxiv.org/abs/2403.12350,,2403.12350.pdf,Friendly Sharpness-Aware Minimization,"Sharpness-Aware Minimization (SAM) has been instrumental in improving deep +neural network training by minimizing both training loss and loss sharpness. +Despite the practical success, the mechanisms behind SAM's generalization +enhancements remain elusive, limiting its progress in deep learning +optimization. In this work, we investigate SAM's core components for +generalization improvement and introduce ""Friendly-SAM"" (F-SAM) to further +enhance SAM's generalization. Our investigation reveals the key role of +batch-specific stochastic gradient noise within the adversarial perturbation, +i.e., the current minibatch gradient, which significantly influences SAM's +generalization performance. By decomposing the adversarial perturbation in SAM +into full gradient and stochastic gradient noise components, we discover that +relying solely on the full gradient component degrades generalization while +excluding it leads to improved performance. The possible reason lies in the +full gradient component's increase in sharpness loss for the entire dataset, +creating inconsistencies with the subsequent sharpness minimization step solely +on the current minibatch data. Inspired by these insights, F-SAM aims to +mitigate the negative effects of the full gradient component. It removes the +full gradient estimated by an exponentially moving average (EMA) of historical +stochastic gradients, and then leverages stochastic gradient noise for improved +generalization. Moreover, we provide theoretical validation for the EMA +approximation and prove the convergence of F-SAM on non-convex problems. +Extensive experiments demonstrate the superior generalization performance and +robustness of F-SAM over vanilla SAM. Code is available at +https://github.com/nblt/F-SAM.",cs.LG,['cs.LG'] +Efficient Hyperparameter Optimization with Adaptive Fidelity Identification,Jiantong Jiang · Zeyi Wen · Atif Mansoor · Ajmal Mian, ,https://arxiv.org/html/2405.15605v2,,2405.15605v2.pdf,Fast-PGM: Fast Probabilistic Graphical Model Learning and Inference,"Probabilistic graphical models (PGMs) serve as a powerful framework for +modeling complex systems with uncertainty and extracting valuable insights from +data. However, users face challenges when applying PGMs to their problems in +terms of efficiency and usability. This paper presents Fast-PGM, an efficient +and open-source library for PGM learning and inference. Fast-PGM supports +comprehensive tasks on PGMs, including structure and parameter learning, as +well as exact and approximate inference, and enhances efficiency of the tasks +through computational and memory optimizations and parallelization techniques. +Concurrently, Fast-PGM furnishes developers with flexible building blocks, +furnishes learners with detailed documentation, and affords non-experts +user-friendly interfaces, thereby ameliorating the usability of PGMs to users +across a spectrum of expertise levels. The source code of Fast-PGM is available +at https://github.com/jjiantong/FastPGM.",cs.LG,['cs.LG'] +Exploring Pose-Aware Human-Object Interaction via Hybrid Learning,EASTMAN Z Y WU · Yali Li · Yuan Wang · Shengjin Wang, ,https://arxiv.org/abs/2403.07246,,2403.07246.pdf,Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration,"Human-object interaction (HOI) detection aims to locate human-object pairs +and identify their interaction categories in images. Most existing methods +primarily focus on supervised learning, which relies on extensive manual HOI +annotations. In this paper, we propose a novel framework, termed Knowledge +Integration to HOI (KI2HOI), that effectively integrates the knowledge of +visual-language model to improve zero-shot HOI detection. Specifically, the +verb feature learning module is designed based on visual semantics, by +employing the verb extraction decoder to convert corresponding verb queries +into interaction-specific category representations. We develop an effective +additive self-attention mechanism to generate more comprehensive visual +representations. Moreover, the innovative interaction representation decoder +effectively extracts informative regions by integrating spatial and visual +feature information through a cross-attention mechanism. To deal with zero-shot +learning in low-data, we leverage a priori knowledge from the CLIP text encoder +to initialize the linear classifier for enhanced interaction understanding. +Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets +demonstrate that our model outperforms the previous methods in various +zero-shot and full-supervised settings.",cs.CV,['cs.CV'] +HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation,Linglin Jing · Yiming Ding · Yunpeng Gao · Zhigang Wang · Xu Yan · Dong Wang · Gerald Schaefer · Hui Fang · Bin Zhao · Xuelong Li, ,https://arxiv.org/abs/2403.16788,,2403.16788.pdf,HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation,"Event-based semantic segmentation has gained popularity due to its capability +to deal with scenarios under high-speed motion and extreme lighting conditions, +which cannot be addressed by conventional RGB cameras. Since it is hard to +annotate event data, previous approaches rely on event-to-image reconstruction +to obtain pseudo labels for training. However, this will inevitably introduce +noise, and learning from noisy pseudo labels, especially when generated from a +single source, may reinforce the errors. This drawback is also called +confirmation bias in pseudo-labeling. In this paper, we propose a novel hybrid +pseudo-labeling framework for unsupervised event-based semantic segmentation, +HPL-ESS, to alleviate the influence of noisy pseudo labels. In particular, we +first employ a plain unsupervised domain adaptation framework as our baseline, +which can generate a set of pseudo labels through self-training. Then, we +incorporate offline event-to-image reconstruction into the framework, and +obtain another set of pseudo labels by predicting segmentation maps on the +reconstructed images. A noisy label learning strategy is designed to mix the +two sets of pseudo labels and enhance the quality. Moreover, we propose a soft +prototypical alignment module to further improve the consistency of target +domain features. Extensive experiments show that our proposed method +outperforms existing state-of-the-art methods by a large margin on the +DSEC-Semantic dataset (+5.88% accuracy, +10.32% mIoU), which even surpasses +several supervised methods.",cs.CV,['cs.CV'] +Text-to-3D Generation with Bidirectional Diffusion using both 3D and 2D priors,Lihe Ding · Shaocong Dong · Zhanpeng Huang · Zibin Wang · Yiyuan Zhang · Kaixiong Gong · Dan Xu · Tianfan Xue, ,https://arxiv.org/abs/2312.04963,,2312.04963.pdf,Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors,"Most 3D generation research focuses on up-projecting 2D foundation models +into the 3D space, either by minimizing 2D Score Distillation Sampling (SDS) +loss or fine-tuning on multi-view datasets. Without explicit 3D priors, these +methods often lead to geometric anomalies and multi-view inconsistency. +Recently, researchers have attempted to improve the genuineness of 3D objects +by directly training on 3D datasets, albeit at the cost of low-quality texture +generation due to the limited texture diversity in 3D datasets. To harness the +advantages of both approaches, we propose Bidirectional Diffusion(BiDiff), a +unified framework that incorporates both a 3D and a 2D diffusion process, to +preserve both 3D fidelity and 2D texture richness, respectively. Moreover, as a +simple combination may yield inconsistent generation results, we further bridge +them with novel bidirectional guidance. In addition, our method can be used as +an initialization of optimization-based models to further improve the quality +of 3D model and efficiency of optimization, reducing the generation process +from 3.4 hours to 20 minutes. Experimental results have shown that our model +achieves high-quality, diverse, and scalable 3D generation. Project website: +https://bidiff.github.io/.",cs.CV,"['cs.CV', 'cs.AI']" +Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives,Kristen Grauman · Andrew Westbury · Lorenzo Torresani · Kris Kitani · Jitendra Malik · Triantafyllos Afouras · Kumar Ashutosh · Vijay Baiyya · Siddhant Bansal · Bikram Boote · Eugene Byrne · Zachary Chavis · Joya Chen · Feng Cheng · Fu-Jen Chu · Sean Crane · Avijit Dasgupta · Jing Dong · Maria Escobar · Cristhian David Forigua Diaz · Abrham Gebreselasie · Sanjay Haresh · Jing Huang · Md Mohaiminul Islam · Suyog Jain · Rawal Khirodkar · Devansh Kukreja · Kevin Liang · Jia-Wei Liu · Sagnik Majumder · Yongsen Mao · Miguel Martin · Effrosyni Mavroudi · Tushar Nagarajan · Francesco Ragusa · Santhosh Kumar Ramakrishnan · Luigi Seminara · Arjun Somayazulu · Yale Song · Shan Su · Zihui Xue · Edward Zhang · Jinxu Zhang · Angela Castillo · Changan Chen · Fu Xinzhu · Ryosuke Furuta · Cristina González · Gupta · Jiabo Hu · Yifei Huang · Yiming Huang · Weslie Khoo · Anush Kumar · Robert Kuo · Sach Lakhavani · Miao Liu · Mi Luo · Zhengyi Luo · Brighid Meredith · Austin Miller · Oluwatumininu Oguntola · Xiaqing Pan · Penny Peng · Shraman Pramanick · Merey Ramazanova · Fiona Ryan · Wei Shan · Kiran Somasundaram · Chenan Song · Audrey Southerland · Masatoshi Tateno · Huiyu Wang · Yuchen Wang · Takuma Yagi · Mingfei Yan · Xitong Yang · Zecheng Yu · Shengxin Zha · Chen Zhao · Ziwei Zhao · Zhifan Zhu · Jeff Zhuo · Pablo ARBELAEZ · Gedas Bertasius · Dima Damen · Jakob Engel · Giovanni Maria Farinella · Antonino Furnari · Bernard Ghanem · Judy Hoffman · C.V. Jawahar · Richard Newcombe · Hyun Soo Park · James Rehg · Yoichi Sato · Manolis Savva · Jianbo Shi · Mike Zheng Shou · Michael Wray,https://ego-exo4d-data.org,https://arxiv.org/abs/2311.18259,,2311.18259.pdf,Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives,"We present Ego-Exo4D, a diverse, large-scale multimodal multiview video +dataset and benchmark challenge. Ego-Exo4D centers around +simultaneously-captured egocentric and exocentric video of skilled human +activities (e.g., sports, music, dance, bike repair). 740 participants from 13 +cities worldwide performed these activities in 123 different natural scene +contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours +of video combined. The multimodal nature of the dataset is unprecedented: the +video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera +poses, IMU, and multiple paired language descriptions -- including a novel +""expert commentary"" done by coaches and teachers and tailored to the +skilled-activity domain. To push the frontier of first-person video +understanding of skilled human activity, we also present a suite of benchmark +tasks and their annotations, including fine-grained activity understanding, +proficiency estimation, cross-view translation, and 3D hand/body pose. All +resources are open sourced to fuel new research in the community. Project page: +http://ego-exo4d-data.org/",cs.CV,"['cs.CV', 'cs.AI']" +Control4D: Efficient 4D Portrait Editing with Text,Ruizhi Shao · Jingxiang Sun · Cheng Peng · Zerong Zheng · Boyao ZHOU · Hongwen Zhang · Yebin Liu,https://control4darxiv.github.io,https://arxiv.org/abs/2405.17405,,2405.17405.pdf,Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer,"We present a novel approach for generating high-quality, spatio-temporally +coherent human videos from a single image under arbitrary viewpoints. Our +framework combines the strengths of U-Nets for accurate condition injection and +diffusion transformers for capturing global correlations across viewpoints and +time. The core is a cascaded 4D transformer architecture that factorizes +attention across views, time, and spatial dimensions, enabling efficient +modeling of the 4D space. Precise conditioning is achieved by injecting human +identity, camera parameters, and temporal signals into the respective +transformers. To train this model, we curate a multi-dimensional dataset +spanning images, videos, multi-view data and 3D/4D scans, along with a +multi-dimensional training strategy. Our approach overcomes the limitations of +previous methods based on GAN or UNet-based diffusion models, which struggle +with complex motions and viewpoint changes. Through extensive experiments, we +demonstrate our method's ability to synthesize realistic, coherent and +free-view human videos, paving the way for advanced multimedia applications in +areas such as virtual reality and animation. Our project website is +https://human4dit.github.io.",cs.CV,['cs.CV'] +Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model,Runmin Dong · Shuai Yuan · Bin Luo · Mengxuan Chen · Jinxiao Zhang · Lixian Zhang · Weijia Li · Juepeng Zheng · Haohuan Fu, ,https://arxiv.org/abs/2403.17460,,2403.17460.pdf,Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model,"Reference-based super-resolution (RefSR) has the potential to build bridges +across spatial and temporal resolutions of remote sensing images. However, +existing RefSR methods are limited by the faithfulness of content +reconstruction and the effectiveness of texture transfer in large scaling +factors. Conditional diffusion models have opened up new opportunities for +generating realistic high-resolution images, but effectively utilizing +reference images within these models remains an area for further exploration. +Furthermore, content fidelity is difficult to guarantee in areas without +relevant reference information. To solve these issues, we propose a +change-aware diffusion model named Ref-Diff for RefSR, using the land cover +change priors to guide the denoising process explicitly. Specifically, we +inject the priors into the denoising model to improve the utilization of +reference information in unchanged areas and regulate the reconstruction of +semantically relevant content in changed areas. With this powerful guidance, we +decouple the semantics-guided denoising and reference texture-guided denoising +processes to improve the model performance. Extensive experiments demonstrate +the superior effectiveness and robustness of the proposed method compared with +state-of-the-art RefSR methods in both quantitative and qualitative +evaluations. The code and data are available at +https://github.com/dongrunmin/RefDiff.",eess.IV,"['eess.IV', 'cs.CV']" +MaskCLR: Attention-Guided Contrastive Learning for Robust Action Representation Learning,Mohamed Abdelfattah · Mariam Hassan · Alex Alahi, ,https://arxiv.org/abs/2312.04819,,2312.04819.pdf,Attention-Guided Contrastive Role Representations for Multi-Agent Reinforcement Learning,"Real-world multi-agent tasks usually involve dynamic team composition with +the emergence of roles, which should also be a key to efficient cooperation in +multi-agent reinforcement learning (MARL). Drawing inspiration from the +correlation between roles and agent's behavior patterns, we propose a novel +framework of **A**ttention-guided **CO**ntrastive **R**ole representation +learning for **M**ARL (**ACORM**) to promote behavior heterogeneity, knowledge +transfer, and skillful coordination across agents. First, we introduce mutual +information maximization to formalize role representation learning, derive a +contrastive learning objective, and concisely approximate the distribution of +negative pairs. Second, we leverage an attention mechanism to prompt the global +state to attend to learned role representations in value decomposition, +implicitly guiding agent coordination in a skillful role space to yield more +expressive credit assignment. Experiments on challenging StarCraft II +micromanagement and Google research football tasks demonstrate the +state-of-the-art performance of our method and its advantages over existing +approaches. Our code is available at +[https://github.com/NJU-RL/ACORM](https://github.com/NJU-RL/ACORM).",cs.MA,['cs.MA'] +"Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs",Sunghwan Hong · Jaewoo Jung · Heeseong Shin · Jiaolong Yang · Chong Luo · Seungryong Kim,https://ku-cvlab.github.io/CoPoNeRF/,https://arxiv.org/abs/2312.07246,,2312.07246.pdf,"Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs","This work delves into the task of pose-free novel view synthesis from stereo +pairs, a challenging and pioneering task in 3D vision. Our innovative +framework, unlike any before, seamlessly integrates 2D correspondence matching, +camera pose estimation, and NeRF rendering, fostering a synergistic enhancement +of these tasks. We achieve this through designing an architecture that utilizes +a shared representation, which serves as a foundation for enhanced 3D geometry +understanding. Capitalizing on the inherent interplay between the tasks, our +unified framework is trained end-to-end with the proposed training strategy to +improve overall model accuracy. Through extensive evaluations across diverse +indoor and outdoor scenes from two real-world datasets, we demonstrate that our +approach achieves substantial improvement over previous methodologies, +especially in scenarios characterized by extreme viewpoint changes and the +absence of accurate camera poses.",cs.CV,['cs.CV'] +Neural Markov Random Field for Stereo Matching,Tongfan Guan · Chen Wang · Yun-Hui Liu,https://github.com/aeolusguan/NMRF,https://arxiv.org/abs/2403.11193,,2403.11193.pdf,Neural Markov Random Field for Stereo Matching,"Stereo matching is a core task for many computer vision and robotics +applications. Despite their dominance in traditional stereo methods, the +hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy +compared to end-to-end deep models. While deep learning representations have +greatly improved the unary terms of the MRF models, the overall accuracy is +still severely limited by the hand-crafted pairwise terms and message passing. +To address these issues, we propose a neural MRF model, where both potential +functions and message passing are designed using data-driven neural networks. +Our fully data-driven model is built on the foundation of variational inference +theory, to prevent convergence issues and retain stereo MRF's graph inductive +bias. To make the inference tractable and scale well to high-resolution images, +we also propose a Disparity Proposal Network (DPN) to adaptively prune the +search space of disparity. The proposed approach ranks $1^{st}$ on both KITTI +2012 and 2015 leaderboards among all published methods while running faster +than 100 ms. This approach significantly outperforms prior global methods, +e.g., lowering D1 metric by more than 50% on KITTI 2015. In addition, our +method exhibits strong cross-domain generalization and can recover sharp edges. +The codes at https://github.com/aeolusguan/NMRF",cs.CV,['cs.CV'] +Self-supervised debiasing using low rank regularization,Geon Yeong Park · Chanyong Jung · Sangmin Lee · Jong Chul Ye · Sang Wan Lee, ,,https://bispl.weebly.com/bispl-news/four-papers-got-accepted-for-cvpr-2024,,,,,nan +Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates,Ka Chun SHUM · Jaeyeon Kim · Binh-Son Hua · Thanh Nguyen · Sai-Kit Yeung,https://github.com/kcshum/pose-conditioned-NeRF-object-fusion,https://arxiv.org/abs/2309.11281,,2309.11281.pdf,Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates,"Neural radiance field is an emerging rendering method that generates +high-quality multi-view consistent images from a neural scene representation +and volume rendering. Although neural radiance field-based techniques are +robust for scene reconstruction, their ability to add or remove objects remains +limited. This paper proposes a new language-driven approach for object +manipulation with neural radiance fields through dataset updates. Specifically, +to insert a new foreground object represented by a set of multi-view images +into a background radiance field, we use a text-to-image diffusion model to +learn and generate combined images that fuse the object of interest into the +given background across views. These combined images are then used for refining +the background radiance field so that we can render view-consistent images +containing both the object and the background. To ensure view consistency, we +propose a dataset updates strategy that prioritizes radiance field training +with camera views close to the already-trained views prior to propagating the +training to remaining views. We show that under the same dataset updates +strategy, we can easily adapt our method for object insertion using data from +text-to-3D models as well as object removal. Experimental results show that our +method generates photorealistic images of the edited scenes, and outperforms +state-of-the-art methods in 3D reconstruction and neural radiance field +blending.",cs.CV,['cs.CV'] +Frozen Feature Augmentation for Few-Shot Image Classification,Andreas Bär · Neil Houlsby · Mostafa Dehghani · Manoj Kumar,https://frozen-feature-augmentation.github.io/,https://arxiv.org/abs/2403.10519,,2403.10519.pdf,Frozen Feature Augmentation for Few-Shot Image Classification,"Training a linear classifier or lightweight model on top of pretrained vision +model outputs, so-called 'frozen features', leads to impressive performance on +a number of downstream few-shot tasks. Currently, frozen features are not +modified during training. On the other hand, when networks are trained directly +on images, data augmentation is a standard recipe that improves performance +with no substantial overhead. In this paper, we conduct an extensive pilot +study on few-shot image classification that explores applying data +augmentations in the frozen feature space, dubbed 'frozen feature augmentation +(FroFA)', covering twenty augmentations in total. Our study demonstrates that +adopting a deceptively simple pointwise FroFA, such as brightness, can improve +few-shot performance consistently across three network architectures, three +large pretraining datasets, and eight transfer datasets.",cs.CV,['cs.CV'] +VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning,Ziyang Luo · Nian Liu · Wangbo Zhao · Xuguang Yang · Dingwen Zhang · Deng-Ping Fan · Fahad Shahbaz Khan · Junwei Han, ,https://arxiv.org/abs/2311.15011,,2311.15011.pdf,VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning,"Salient object detection (SOD) and camouflaged object detection (COD) are +related yet distinct binary mapping tasks. These tasks involve multiple +modalities, sharing commonalities and unique cues. Existing research often +employs intricate task-specific specialist models, potentially leading to +redundancy and suboptimal results. We introduce VSCode, a generalist model with +novel 2D prompt learning, to jointly address four SOD tasks and three COD +tasks. We utilize VST as the foundation model and introduce 2D prompts within +the encoder-decoder architecture to learn domain and task-specific knowledge on +two separate dimensions. A prompt discrimination loss helps disentangle +peculiarities to benefit model optimization. VSCode outperforms +state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot +generalization to unseen tasks by combining 2D prompts, such as RGB-D COD. +Source code has been available at https://github.com/Sssssuperior/VSCode.",cs.CV,['cs.CV'] +GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds,Prashant Kumar · Kshitij Madhav Bhat · Vedang Bhupesh Shenvi Nadkarni · Prem Kalra, ,https://arxiv.org/abs/2312.00068,,2312.00068.pdf,GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds,"Sparse LiDAR point clouds cause severe loss of detail of static structures +and reduce the density of static points available for navigation. Reduced +density can be detrimental to navigation under several scenarios. We observe +that despite high sparsity, in most cases, the global topology of LiDAR +outlining the static structures can be inferred. We utilize this property to +obtain a backbone skeleton of a LiDAR scan in the form of a single connected +component that is a proxy to its global topology. We utilize the backbone to +augment new points along static structures to overcome sparsity. Newly +introduced points could correspond to existing static structures or to static +points that were earlier obstructed by dynamic objects. To the best of our +knowledge, we are the first to use such a strategy for sparse LiDAR point +clouds. Existing solutions close to our approach fail to identify and preserve +the global static LiDAR topology and generate sub-optimal points. We propose +GLiDR, a Graph Generative network that is topologically regularized using +0-dimensional Persistent Homology ($\mathcal{PH}$) constraints. This enables +GLiDR to introduce newer static points along a topologically consistent global +static LiDAR backbone. GLiDR generates precise static points using $32\times$ +sparser dynamic scans and performs better than the baselines across three +datasets. GLiDR generates a valuable byproduct - an accurate binary +segmentation mask of static and dynamic objects that are helpful for navigation +planning and safety in constrained environments. The newly introduced static +points allow GLiDR to outperform LiDAR-based navigation using SLAM in several +settings. Source code is available at https://kshitijbhat.github.io/glidr",cs.RO,"['cs.RO', 'cs.CV']" +The STVchrono Dataset: Towards Continuous Change Recognition in Time,Yanjun Sun · Yue Qiu · Mariia Khan · Fumiya Matsuzawa · Kenji Iwata, ,,https://www.youtube.com/watch?v=44o-Xl60ipI,,,,,nan +NECA: Neural Customizable Human Avatar,Junjin Xiao · Qing Zhang · Zhan Xu · Wei-Shi Zheng,https://github.com/iSEE-Laboratory/NECA,https://arxiv.org/abs/2403.10335,,2403.10335.pdf,NECA: Neural Customizable Human Avatar,"Human avatar has become a novel type of 3D asset with various applications. +Ideally, a human avatar should be fully customizable to accommodate different +settings and environments. In this work, we introduce NECA, an approach capable +of learning versatile human representation from monocular or sparse-view +videos, enabling granular customization across aspects such as pose, shadow, +shape, lighting and texture. The core of our approach is to represent humans in +complementary dual spaces and predict disentangled neural fields of geometry, +albedo, shadow, as well as an external lighting, from which we are able to +derive realistic rendering with high-frequency details via volumetric +rendering. Extensive experiments demonstrate the advantage of our method over +the state-of-the-art methods in photorealistic rendering, as well as various +editing tasks such as novel pose synthesis and relighting. The code is +available at https://github.com/iSEE-Laboratory/NECA.",cs.CV,['cs.CV'] +Continual Segmentation with Disentangled Objectness Learning and Class Recognition,Yizheng Gong · Siyue Yu · Xiaoyang Wang · Jimin Xiao, ,https://arxiv.org/abs/2403.03477,,2403.03477.pdf,Continual Segmentation with Disentangled Objectness Learning and Class Recognition,"Most continual segmentation methods tackle the problem as a per-pixel +classification task. However, such a paradigm is very challenging, and we find +query-based segmenters with built-in objectness have inherent advantages +compared with per-pixel ones, as objectness has strong transfer ability and +forgetting resistance. Based on these findings, we propose CoMasTRe by +disentangling continual segmentation into two stages: forgetting-resistant +continual objectness learning and well-researched continual classification. +CoMasTRe uses a two-stage segmenter learning class-agnostic mask proposals at +the first stage and leaving recognition to the second stage. During continual +learning, a simple but effective distillation is adopted to strengthen +objectness. To further mitigate the forgetting of old classes, we design a +multi-label class distillation strategy suited for segmentation. We assess the +effectiveness of CoMasTRe on PASCAL VOC and ADE20K. Extensive experiments show +that our method outperforms per-pixel and query-based methods on both datasets. +Code will be available at https://github.com/jordangong/CoMasTRe.",cs.CV,['cs.CV'] +Text2Loc: 3D Point Cloud Localization from Natural Language,Yan Xia · Letian Shi · Zifeng Ding · João F. Henriques · Daniel Cremers, ,https://arxiv.org/abs/2311.15977,,2311.15977.pdf,Text2Loc: 3D Point Cloud Localization from Natural Language,"We tackle the problem of 3D point cloud localization based on a few natural +linguistic descriptions and introduce a novel neural network, Text2Loc, that +fully interprets the semantic relationship between points and text. Text2Loc +follows a coarse-to-fine localization pipeline: text-submap global place +recognition, followed by fine localization. In global place recognition, +relational dynamics among each textual hint are captured in a hierarchical +transformer with max-pooling (HTM), whereas a balance between positive and +negative pairs is maintained using text-submap contrastive learning. Moreover, +we propose a novel matching-free fine localization method to further refine the +location predictions, which completely removes the need for complicated +text-instance matching and is lighter, faster, and more accurate than previous +methods. Extensive experiments show that Text2Loc improves the localization +accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose +dataset. Our project page is publicly available at +\url{https://yan-xia.github.io/projects/text2loc/}.",cs.CV,['cs.CV'] +OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation,Ganlong Zhao · Guanbin Li · Weikai Chen · Yizhou Yu, ,https://arxiv.org/abs/2403.17334,,2403.17334.pdf,OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation,"Recent advances in Iterative Vision-and-Language Navigation (IVLN) introduce +a more meaningful and practical paradigm of VLN by maintaining the agent's +memory across tours of scenes. Although the long-term memory aligns better with +the persistent nature of the VLN task, it poses more challenges on how to +utilize the highly unstructured navigation memory with extremely sparse +supervision. Towards this end, we propose OVER-NAV, which aims to go over and +beyond the current arts of IVLN techniques. In particular, we propose to +incorporate LLMs and open-vocabulary detectors to distill key information and +establish correspondence between multi-modal signals. Such a mechanism +introduces reliable cross-modal supervision and enables on-the-fly +generalization to unseen scenes without the need of extra annotation and +re-training. To fully exploit the interpreted navigation data, we further +introduce a structured representation, coded Omnigraph, to effectively +integrate multi-modal information along the tour. Accompanied with a novel +omnigraph fusion mechanism, OVER-NAV is able to extract the most relevant +knowledge from omnigraph for a more accurate navigating action. In addition, +OVER-NAV seamlessly supports both discrete and continuous environments under a +unified framework. We demonstrate the superiority of OVER-NAV in extensive +experiments.",cs.CV,['cs.CV'] +Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework,Ziyao Huang · Fan Tang · Yong Zhang · Xiaodong Cun · Juan Cao · Jintao Li · Tong-yee Lee,https://github.com/ICTMCG/Make-Your-Anchor,https://arxiv.org/abs/2403.16510,,2403.16510.pdf,Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework,"Despite the remarkable process of talking-head-based avatar-creating +solutions, directly generating anchor-style videos with full-body motions +remains challenging. In this study, we propose Make-Your-Anchor, a novel system +necessitating only a one-minute video clip of an individual for training, +subsequently enabling the automatic generation of anchor-style videos with +precise torso and hand movements. Specifically, we finetune a proposed +structure-guided diffusion model on input video to render 3D mesh conditions +into human appearances. We adopt a two-stage training strategy for the +diffusion model, effectively binding movements with specific appearances. To +produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise +diffusion model to a 3D style without additional training cost, and a simple +yet effective batch-overlapped temporal denoising module is proposed to bypass +the constraints on video length during inference. Finally, a novel +identity-specific face enhancement module is introduced to improve the visual +quality of facial regions in the output videos. Comparative experiments +demonstrate the effectiveness and superiority of the system in terms of visual +quality, temporal coherence, and identity preservation, outperforming SOTA +diffusion/non-diffusion methods. Project page: +\url{https://github.com/ICTMCG/Make-Your-Anchor}.",cs.CV,['cs.CV'] +SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking,Xiaojun Hou · Jiazheng Xing · Yijie Qian · Yaowei Guo · Shuo Xin · Junhao Chen · Kai Tang · Mengmeng Wang · Zhengkai Jiang · Liang Liu · Yong Liu,https://github.com/hoqolo/SDSTrack,https://arxiv.org/abs/2403.16002,,2403.16002.pdf,SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking,"Multimodal Visual Object Tracking (VOT) has recently gained significant +attention due to its robustness. Early research focused on fully fine-tuning +RGB-based trackers, which was inefficient and lacked generalized representation +due to the scarcity of multimodal data. Therefore, recent studies have utilized +prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. +However, the modality gap limits pre-trained knowledge recall, and the +dominance of the RGB modality persists, preventing the full utilization of +information from other modalities. To address these issues, we propose a novel +symmetric multimodal tracking framework called SDSTrack. We introduce +lightweight adaptation for efficient fine-tuning, which directly transfers the +feature extraction ability from RGB to other domains with a small number of +trainable parameters and integrates multimodal features in a balanced, +symmetric manner. Furthermore, we design a complementary masked patch +distillation strategy to enhance the robustness of trackers in complex +environments, such as extreme weather, poor imaging, and sensor failure. +Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art +methods in various multimodal tracking scenarios, including RGB+Depth, +RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme +conditions. Our source code is available at https://github.com/hoqolo/SDSTrack.",cs.CV,['cs.CV'] +"Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras",Huajian Huang · Longwei Li · Hui Cheng · Sai-Kit Yeung, ,https://arxiv.org/abs/2311.16728,,2311.16728.pdf,"Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras","The integration of neural rendering and the SLAM system recently showed +promising results in joint localization and photorealistic view reconstruction. +However, existing methods, fully relying on implicit representations, are so +resource-hungry that they cannot run on portable devices, which deviates from +the original intention of SLAM. In this paper, we present Photo-SLAM, a novel +SLAM framework with a hyper primitives map. Specifically, we simultaneously +exploit explicit geometric features for localization and learn implicit +photometric features to represent the texture information of the observed +environment. In addition to actively densifying hyper primitives based on +geometric features, we further introduce a Gaussian-Pyramid-based training +method to progressively learn multi-level features, enhancing photorealistic +mapping performance. The extensive experiments with monocular, stereo, and +RGB-D datasets prove that our proposed system Photo-SLAM significantly +outperforms current state-of-the-art SLAM systems for online photorealistic +mapping, e.g., PSNR is 30% higher and rendering speed is hundreds of times +faster in the Replica dataset. Moreover, the Photo-SLAM can run at real-time +speed using an embedded platform such as Jetson AGX Orin, showing the potential +of robotics applications.",cs.CV,['cs.CV'] +Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation,Wenhao Li · Mengyuan Liu · Hong Liu · Pichao Wang · Jialun Cai · Nicu Sebe,https://github.com/NationalGAILab/HoT,,https://paperswithcode.com/paper/hourglass-tokenizer-for-efficient-transformer,,,,,nan +Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection,Jiaming Li · Jiacheng Zhang · Jichang Li · Ge Li · Si Liu · Liang Lin · Guanbin Li, ,https://arxiv.org/abs/2404.09216,,2404.09216.pdf,DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection,"Existing open-vocabulary object detectors typically require a predefined set +of categories from users, significantly confining their application scenarios. +In this paper, we introduce DetCLIPv3, a high-performing detector that excels +not only at both open-vocabulary object detection, but also generating +hierarchical labels for detected objects. DetCLIPv3 is characterized by three +core designs: 1. Versatile model architecture: we derive a robust open-set +detection framework which is further empowered with generation ability via the +integration of a caption head. 2. High information density data: we develop an +auto-annotation pipeline leveraging visual large language model to refine +captions for large-scale image-text pairs, providing rich, multi-granular +object labels to enhance the training. 3. Efficient training strategy: we +employ a pre-training stage with low-resolution inputs that enables the object +captioner to efficiently learn a broad spectrum of visual concepts from +extensive image-text paired data. This is followed by a fine-tuning stage that +leverages a small number of high-resolution samples to further enhance +detection performance. With these effective designs, DetCLIPv3 demonstrates +superior open-vocabulary detection performance, \eg, our Swin-T backbone model +achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, +outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, +respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense +captioning task on VG dataset, showcasing its strong generative capability.",cs.CV,['cs.CV'] +3D Feature Tracking via Event Camera,Siqi Li · Zhou Zhikuan · Zhou Xue · Yipeng Li · Shaoyi Du · Yue Gao, ,https://cvpr.thecvf.com/Conferences/2023/AuthorQAEventCameras,,,,,,nan +Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset,Yiming Li · Zhiheng Li · Nuo Chen · Moonjun Gong · Zonglin Lyu · Zehong Wang · Peili Jiang · Chen Feng, ,https://ar5iv.labs.arxiv.org/html/2202.08449,,2202.08449.pdf,V2X-Sim: Multi-Agent Collaborative Perception Dataset and Benchmark for Autonomous Driving,"Vehicle-to-everything (V2X) communication techniques enable the collaboration +between vehicles and many other entities in the neighboring environment, which +could fundamentally improve the perception system for autonomous driving. +However, the lack of a public dataset significantly restricts the research +progress of collaborative perception. To fill this gap, we present V2X-Sim, a +comprehensive simulated multi-agent perception dataset for V2X-aided autonomous +driving. V2X-Sim provides: (1) \hl{multi-agent} sensor recordings from the +road-side unit (RSU) and multiple vehicles that enable collaborative +perception, (2) multi-modality sensor streams that facilitate multi-modality +perception, and (3) diverse ground truths that support various perception +tasks. Meanwhile, we build an open-source testbed and provide a benchmark for +the state-of-the-art collaborative perception algorithms on three tasks, +including detection, tracking and segmentation. V2X-Sim seeks to stimulate +collaborative perception research for autonomous driving before realistic +datasets become widely available. Our dataset and code are available at +\url{https://ai4ce.github.io/V2X-Sim/}.",cs.CV,['cs.CV'] +Taming Stable Diffusion for Text to 360$^{\circ}$ Panorama Image Generation,Cheng Zhang · Qianyi Wu · Camilo Cruz Gambardella · Xiaoshui Huang · Dinh Phung · Wanli Ouyang · Jianfei Cai, ,https://arxiv.org/abs/2404.07949,,2404.07949.pdf,Taming Stable Diffusion for Text to 360° Panorama Image Generation,"Generative models, e.g., Stable Diffusion, have enabled the creation of +photorealistic images from text prompts. Yet, the generation of 360-degree +panorama images from text remains a challenge, particularly due to the dearth +of paired text-panorama data and the domain gap between panorama and +perspective images. In this paper, we introduce a novel dual-branch diffusion +model named PanFusion to generate a 360-degree image from a text prompt. We +leverage the stable diffusion model as one branch to provide prior knowledge in +natural image generation and register it to another panorama branch for +holistic image generation. We propose a unique cross-attention mechanism with +projection awareness to minimize distortion during the collaborative denoising +process. Our experiments validate that PanFusion surpasses existing methods +and, thanks to its dual-branch structure, can integrate additional constraints +like room layout for customized panorama outputs. Code is available at +https://chengzhag.github.io/publication/panfusion.",cs.CV,['cs.CV'] +Frequency-aware Event-based Video Deblurring for Real-World Motion Blur,Taewoo Kim · Hoonhee Cho · Kuk-Jin Yoon, ,https://arxiv.org/abs/2404.12168,,,Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization,"As recent advances in mobile camera technology have enabled the capability to +capture high-resolution images, such as 4K images, the demand for an efficient +deblurring model handling large motion has increased. In this paper, we +discover that the image residual errors, i.e., blur-sharp pixel differences, +can be grouped into some categories according to their motion blur type and how +complex their neighboring pixels are. Inspired by this, we decompose the +deblurring (regression) task into blur pixel discretization (pixel-level blur +classification) and discrete-to-continuous conversion (regression with blur +class map) tasks. Specifically, we generate the discretized image residual +errors by identifying the blur pixels and then transform them to a continuous +form, which is computationally more efficient than naively solving the original +regression problem with continuous values. Here, we found that the +discretization result, i.e., blur segmentation map, remarkably exhibits visual +similarity with the image residual errors. As a result, our efficient model +shows comparable performance to state-of-the-art methods in realistic +benchmarks, while our method is up to 10 times computationally more efficient.",cs.CV,"['cs.CV', 'cs.AI']" +Snapshot Lidar: Fourier embedding of amplitude and phase for single-image depth reconstruction,Sarah Friday · Yunzi Shi · Yaswanth Kumar Cherivirala · Vishwanath Saragadam · Adithya Pediredla, ,https://arxiv.org/abs/2311.10950,,2311.10950.pdf,Single-shot Phase Retrieval from a Fractional Fourier Transform Perspective,"The realm of classical phase retrieval concerns itself with the arduous task +of recovering a signal from its Fourier magnitude measurements, which are +fraught with inherent ambiguities. A single-exposure intensity measurement is +commonly deemed insufficient for the reconstruction of the primal signal, given +that the absent phase component is imperative for the inverse transformation. +In this work, we present a novel single-shot phase retrieval paradigm from a +fractional Fourier transform (FrFT) perspective, which involves integrating the +FrFT-based physical measurement model within a self-supervised reconstruction +scheme. Specifically, the proposed FrFT-based measurement model addresses the +aliasing artifacts problem in the numerical calculation of Fresnel diffraction, +featuring adaptability to both short-distance and long-distance propagation +scenarios. Moreover, the intensity measurement in the FrFT domain proves highly +effective in alleviating the ambiguities of phase retrieval and relaxing the +previous conditions on oversampled or multiple measurements in the Fourier +domain. Furthermore, the proposed self-supervised reconstruction approach +harnesses the fast discrete algorithm of FrFT alongside untrained neural +network priors, thereby attaining preeminent results. Through numerical +simulations, we demonstrate that both amplitude and phase objects can be +effectively retrieved from a single-shot intensity measurement using the +proposed approach and provide a promising technique for support-free coherent +diffraction imaging.",cs.CV,"['cs.CV', 'physics.optics']" +ODCR: Orthogonal Decoupling Contrastive Regularization for Unpaired Image Dehazing,Zhongze Wang · Haitao Zhao · Jingchao Peng · Lujian Yao · Kaijie Zhao, ,https://arxiv.org/abs/2404.17825,,2404.17825.pdf,ODCR: Orthogonal Decoupling Contrastive Regularization for Unpaired Image Dehazing,"Unpaired image dehazing (UID) holds significant research importance due to +the challenges in acquiring haze/clear image pairs with identical backgrounds. +This paper proposes a novel method for UID named Orthogonal Decoupling +Contrastive Regularization (ODCR). Our method is grounded in the assumption +that an image consists of both haze-related features, which influence the +degree of haze, and haze-unrelated features, such as texture and semantic +information. ODCR aims to ensure that the haze-related features of the dehazing +result closely resemble those of the clear image, while the haze-unrelated +features align with the input hazy image. To accomplish the motivation, +Orthogonal MLPs optimized geometrically on the Stiefel manifold are proposed, +which can project image features into an orthogonal space, thereby reducing the +relevance between different features. Furthermore, a task-driven Depth-wise +Feature Classifier (DWFC) is proposed, which assigns weights to the orthogonal +features based on the contribution of each channel's feature in predicting +whether the feature source is hazy or clear in a self-supervised fashion. +Finally, a Weighted PatchNCE (WPNCE) loss is introduced to achieve the pulling +of haze-related features in the output image toward those of clear images, +while bringing haze-unrelated features close to those of the hazy input. +Extensive experiments demonstrate the superior performance of our ODCR method +on UID.",cs.CV,['cs.CV'] +MaxQ: Multi-Axis Query for N:M Sparsity Network,Jingyang Xiang · Siqi Li · Junhao Chen · Zhuangzhi Chen · Tianxin Huang · Linpeng Peng · Yong Liu,https://github.com/JingyangXiang/MaxQ,https://arxiv.org/abs/2312.07061,,2312.07061.pdf,MaxQ: Multi-Axis Query for N:M Sparsity Network,"N:M sparsity has received increasing attention due to its remarkable +performance and latency trade-off compared with structured and unstructured +sparsity. However, existing N:M sparsity methods do not differentiate the +relative importance of weights among blocks and leave important weights +underappreciated. Besides, they directly apply N:M sparsity to the whole +network, which will cause severe information loss. Thus, they are still +sub-optimal. In this paper, we propose an efficient and effective Multi-Axis +Query methodology, dubbed as MaxQ, to rectify these problems. During the +training, MaxQ employs a dynamic approach to generate soft N:M masks, +considering the weight importance across multiple axes. This method enhances +the weights with more importance and ensures more effective updates. Meanwhile, +a sparsity strategy that gradually increases the percentage of N:M weight +blocks is applied, which allows the network to heal from the pruning-induced +damage progressively. During the runtime, the N:M soft masks can be precomputed +as constants and folded into weights without causing any distortion to the +sparse pattern and incurring additional computational overhead. Comprehensive +experiments demonstrate that MaxQ achieves consistent improvements across +diverse CNN architectures in various computer vision tasks, including image +classification, object detection and instance segmentation. For ResNet50 with +1:16 sparse pattern, MaxQ can achieve 74.6\% top-1 accuracy on ImageNet and +improve by over 2.8\% over the state-of-the-art. Codes and checkpoints are +available at \url{https://github.com/JingyangXiang/MaxQ}.",cs.CV,['cs.CV'] +Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework,Vu Minh Hieu Phan · Yutong Xie · Yuankai Qi · Lingqiao Liu · Liyang Liu · Bowen Zhang · Zhibin Liao · Qi Wu · Minh-Son To · Johan Verjans, ,https://arxiv.org/abs/2403.07636v2,,2403.07636v2.pdf,Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework,"Medical vision language pre-training (VLP) has emerged as a frontier of +research, enabling zero-shot pathological recognition by comparing the query +image with the textual descriptions for each disease. Due to the complex +semantics of biomedical texts, current methods struggle to align medical images +with key pathological findings in unstructured reports. This leads to the +misalignment with the target disease's textual representation. In this paper, +we introduce a novel VLP framework designed to dissect disease descriptions +into their fundamental aspects, leveraging prior knowledge about the visual +manifestations of pathologies. This is achieved by consulting a large language +model and medical experts. Integrating a Transformer module, our approach +aligns an input image with the diverse elements of a disease, generating +aspect-centric image representations. By consolidating the matches from each +aspect, we improve the compatibility between an image and its associated +disease. Additionally, capitalizing on the aspect-oriented representations, we +present a dual-head Transformer tailored to process known and unknown diseases, +optimizing the comprehensive detection efficacy. Conducting experiments on +seven downstream datasets, ours improves the accuracy of recent methods by up +to 8.56% and 17.0% for seen and unseen categories, respectively. Our code is +released at https://github.com/HieuPhan33/MAVL.",cs.CV,['cs.CV'] +EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation,Chanyoung Kim · Woojung Han · Dayun Ju · Seong Jae Hwang,https://micv-yonsei.github.io/eagle2024/,https://arxiv.org/abs/2403.01482,,2403.01482.pdf,EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation,"Semantic segmentation has innately relied on extensive pixel-level annotated +data, leading to the emergence of unsupervised methodologies. Among them, +leveraging self-supervised Vision Transformers for unsupervised semantic +segmentation (USS) has been making steady progress with expressive deep +features. Yet, for semantically segmenting images with complex objects, a +predominant challenge remains: the lack of explicit object-level semantic +encoding in patch-level features. This technical limitation often leads to +inadequate segmentation of complex objects with diverse structures. To address +this gap, we present a novel approach, EAGLE, which emphasizes object-centric +representation learning for unsupervised semantic segmentation. Specifically, +we introduce EiCue, a spectral technique providing semantic and structural cues +through an eigenbasis derived from the semantic similarity matrix of deep image +features and color affinity from an image. Further, by incorporating our +object-centric contrastive loss with EiCue, we guide our model to learn +object-level representations with intra- and inter-image object-feature +consistency, thereby enhancing semantic accuracy. Extensive experiments on +COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art +USS results of EAGLE with accurate and consistent semantic segmentation across +complex scenes.",cs.CV,['cs.CV'] +StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On,Jeongho Kim · Gyojung Gu · Minho Park · Sunghyun Park · Jaegul Choo,https://rlawjdghek.github.io/StableVITON/,https://arxiv.org/abs/2312.01725,,2312.01725.pdf,StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On,"Given a clothing image and a person image, an image-based virtual try-on aims +to generate a customized image that appears natural and accurately reflects the +characteristics of the clothing image. In this work, we aim to expand the +applicability of the pre-trained diffusion model so that it can be utilized +independently for the virtual try-on task.The main challenge is to preserve the +clothing details while effectively utilizing the robust generative capability +of the pre-trained model. In order to tackle these issues, we propose +StableVITON, learning the semantic correspondence between the clothing and the +human body within the latent space of the pre-trained diffusion model in an +end-to-end manner. Our proposed zero cross-attention blocks not only preserve +the clothing details by learning the semantic correspondence but also generate +high-fidelity images by utilizing the inherent knowledge of the pre-trained +model in the warping process. Through our proposed novel attention total +variation loss and applying augmentation, we achieve the sharp attention map, +resulting in a more precise representation of clothing details. StableVITON +outperforms the baselines in qualitative and quantitative evaluation, showing +promising quality in arbitrary person images. Our code is available at +https://github.com/rlawjdghek/StableVITON.",cs.CV,['cs.CV'] +Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions,Yujeong Chae · Hyeonseong Kim · Kuk-Jin Yoon, ,https://arxiv.org/abs/2310.00944,,2310.00944.pdf,Towards Robust 3D Object Detection In Rainy Conditions,"LiDAR sensors are used in autonomous driving applications to accurately +perceive the environment. However, they are affected by adverse weather +conditions such as snow, fog, and rain. These everyday phenomena introduce +unwanted noise into the measurements, severely degrading the performance of +LiDAR-based perception systems. In this work, we propose a framework for +improving the robustness of LiDAR-based 3D object detectors against road spray. +Our approach uses a state-of-the-art adverse weather detection network to +filter out spray from the LiDAR point cloud, which is then used as input for +the object detector. In this way, the detected objects are less affected by the +adverse weather in the scene, resulting in a more accurate perception of the +environment. In addition to adverse weather filtering, we explore the use of +radar targets to further filter false positive detections. Tests on real-world +data show that our approach improves the robustness to road spray of several +popular 3D object detectors.",cs.CV,"['cs.CV', 'cs.LG']" +ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks,Andrea Rosasco · Stefano Berti · Giulia Pasquale · Damiano Malafronte · Shogo Sato · Hiroyuki Segawa · Tetsugo Inada · Lorenzo Natale, ,,https://paperswithcode.com/paper/open-ended-vqa-benchmarking-of-vision,,,,,nan +Honeybee: Locality-enhanced Projector for Multimodal LLM,Junbum Cha · Woo-Young Kang · Jonghwan Mun · Byungseok Roh, ,https://arxiv.org/abs/2312.06742,,2312.06742.pdf,Honeybee: Locality-enhanced Projector for Multimodal LLM,"In Multimodal Large Language Models (MLLMs), a visual projector plays a +crucial role in bridging pre-trained vision encoders with LLMs, enabling +profound visual understanding while harnessing the LLMs' robust capabilities. +Despite the importance of the visual projector, it has been relatively less +explored. In this study, we first identify two essential projector properties: +(i) flexibility in managing the number of visual tokens, crucial for MLLMs' +overall efficiency, and (ii) preservation of local context from visual +features, vital for spatial understanding. Based on these findings, we propose +a novel projector design that is both flexible and locality-enhanced, +effectively satisfying the two desirable properties. Additionally, we present +comprehensive strategies to effectively utilize multiple and multifaceted +instruction datasets. Through extensive experiments, we examine the impact of +individual design choices. Finally, our proposed MLLM, Honeybee, remarkably +outperforms previous state-of-the-art methods across various benchmarks, +including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly +higher efficiency. Code and models are available at +https://github.com/kakaobrain/honeybee.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search,Junghyup Lee · Bumsub Ham, ,https://arxiv.org/abs/2403.19232,,2403.19232.pdf,AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search,"Training-free network architecture search (NAS) aims to discover +high-performing networks with zero-cost proxies, capturing network +characteristics related to the final performance. However, network rankings +estimated by previous training-free NAS methods have shown weak correlations +with the performance. To address this issue, we propose AZ-NAS, a novel +approach that leverages the ensemble of various zero-cost proxies to enhance +the correlation between a predicted ranking of networks and the ground truth +substantially in terms of the performance. To achieve this, we introduce four +novel zero-cost proxies that are complementary to each other, analyzing +distinct traits of architectures in the views of expressivity, progressivity, +trainability, and complexity. The proxy scores can be obtained simultaneously +within a single forward and backward pass, making an overall NAS process highly +efficient. In order to integrate the rankings predicted by our proxies +effectively, we introduce a non-linear ranking aggregation method that +highlights the networks highly-ranked consistently across all the proxies. +Experimental results conclusively demonstrate the efficacy and efficiency of +AZ-NAS, outperforming state-of-the-art methods on standard benchmarks, all +while maintaining a reasonable runtime cost.",cs.CV,"['cs.CV', 'cs.LG']" +Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching,Xianqi Wang · Gangwei Xu · Hao Jia · Xin Yang,https://github.com/Windsrain/Selective-Stereo,https://arxiv.org/abs/2403.00486,,2403.00486.pdf,Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching,"Stereo matching methods based on iterative optimization, like RAFT-Stereo and +IGEV-Stereo, have evolved into a cornerstone in the field of stereo matching. +However, these methods struggle to simultaneously capture high-frequency +information in edges and low-frequency information in smooth regions due to the +fixed receptive field. As a result, they tend to lose details, blur edges, and +produce false matches in textureless areas. In this paper, we propose Selective +Recurrent Unit (SRU), a novel iterative update operator for stereo matching. +The SRU module can adaptively fuse hidden disparity information at multiple +frequencies for edge and smooth regions. To perform adaptive fusion, we +introduce a new Contextual Spatial Attention (CSA) module to generate attention +maps as fusion weights. The SRU empowers the network to aggregate hidden +disparity information across multiple frequencies, mitigating the risk of vital +hidden disparity information loss during iterative processes. To verify SRU's +universality, we apply it to representative iterative stereo matching methods, +collectively referred to as Selective-Stereo. Our Selective-Stereo ranks +$1^{st}$ on KITTI 2012, KITTI 2015, ETH3D, and Middlebury leaderboards among +all published methods. Code is available at +https://github.com/Windsrain/Selective-Stereo.",cs.CV,['cs.CV'] +Learning the 3D Fauna of the Web,Zizhang Li · Dor Litvak · Ruining Li · Yunzhi Zhang · Tomas Jakab · Christian Rupprecht · Shangzhe Wu · Andrea Vedaldi · Jiajun Wu, ,https://arxiv.org/abs/2401.02400,,2401.02400.pdf,Learning the 3D Fauna of the Web,"Learning 3D models of all animals on the Earth requires massively scaling up +existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an +approach that learns a pan-category deformable 3D animal model for more than +100 animal species jointly. One crucial bottleneck of modeling animals is the +limited availability of training data, which we overcome by simply learning +from 2D Internet images. We show that prior category-specific attempts fail to +generalize to rare species with limited training images. We address this +challenge by introducing the Semantic Bank of Skinned Models (SBSM), which +automatically discovers a small set of base animal shapes by combining +geometric inductive priors with semantic knowledge implicitly captured by an +off-the-shelf self-supervised feature extractor. To train such a model, we also +contribute a new large-scale dataset of diverse animal species. At inference +time, given a single image of any quadruped animal, our model reconstructs an +articulated 3D mesh in a feed-forward fashion within seconds.",cs.CV,['cs.CV'] +LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking,Jialin Li · Qiang Nie · Weifu Fu · Yuhuan Lin · Guangpin Tao · Yong Liu · Chengjie Wang, ,https://arxiv.org/abs/2403.04303,,2403.04303.pdf,LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking,"Deep learning models, particularly those based on transformers, often employ +numerous stacked structures, which possess identical architectures and perform +similar functions. While effective, this stacking paradigm leads to a +substantial increase in the number of parameters, posing challenges for +practical applications. In today's landscape of increasingly large models, +stacking depth can even reach dozens, further exacerbating this issue. To +mitigate this problem, we introduce LORS (LOw-rank Residual Structure). LORS +allows stacked modules to share the majority of parameters, requiring a much +smaller number of unique ones per module to match or even surpass the +performance of using entirely distinct ones, thereby significantly reducing +parameter usage. We validate our method by applying it to the stacked decoders +of a query-based object detector, and conduct extensive experiments on the +widely used MS COCO dataset. Experimental results demonstrate the effectiveness +of our method, as even with a 70\% reduction in the parameters of the decoder, +our method still enables the model to achieve comparable or",cs.CV,['cs.CV'] +VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation,Yang Chen · Yingwei Pan · haibo yang · Ting Yao · Tao Mei,https://vp3d-cvpr24.github.io/,https://arxiv.org/abs/2403.17001,,2403.17001.pdf,VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation,"Recent innovations on text-to-3D generation have featured Score Distillation +Sampling (SDS), which enables the zero-shot learning of implicit 3D models +(NeRF) by directly distilling prior knowledge from 2D diffusion models. +However, current SDS-based models still struggle with intricate text prompts +and commonly result in distorted 3D models with unrealistic textures or +cross-view inconsistency issues. In this work, we introduce a novel Visual +Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the +visual appearance knowledge in 2D visual prompt to boost text-to-3D generation. +Instead of solely supervising SDS with text prompt, VP3D first capitalizes on +2D diffusion model to generate a high-quality image from input text, which +subsequently acts as visual prompt to strengthen SDS optimization with explicit +visual appearance. Meanwhile, we couple the SDS optimization with additional +differentiable reward function that encourages rendering images of 3D models to +better visually align with 2D visual prompt and semantically match with text +prompt. Through extensive experiments, we show that the 2D Visual Prompt in our +VP3D significantly eases the learning of visual appearance of 3D models and +thus leads to higher visual fidelity with more detailed textures. It is also +appealing in view that when replacing the self-generating visual prompt with a +given reference image, VP3D is able to trigger a new task of stylized +text-to-3D generation. Our project page is available at +https://vp3d-cvpr24.github.io.",cs.CV,"['cs.CV', 'cs.MM']" +Vlogger: Make Your Dream A Vlog,Shaobin Zhuang · Kunchang Li · Xinyuan Chen · Yaohui Wang · Ziwei Liu · Yu Qiao · Yali Wang,https://github.com/zhuangshaobin/Vlogger,https://arxiv.org/abs/2401.09414,,2401.09414.pdf,Vlogger: Make Your Dream A Vlog,"In this work, we present Vlogger, a generic AI system for generating a +minute-level video blog (i.e., vlog) of user descriptions. Different from short +videos with a few seconds, vlog often contains a complex storyline with +diversified scenes, which is challenging for most existing video generation +approaches. To break through this bottleneck, our Vlogger smartly leverages +Large Language Model (LLM) as Director and decomposes a long video generation +task of vlog into four key stages, where we invoke various foundation models to +play the critical roles of vlog professionals, including (1) Script, (2) Actor, +(3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, +our Vlogger can generate vlogs through explainable cooperation of top-down +planning and bottom-up shooting. Moreover, we introduce a novel video diffusion +model, ShowMaker, which serves as a videographer in our Vlogger for generating +the video snippet of each shooting scene. By incorporating Script and Actor +attentively as textual and visual prompts, it can effectively enhance +spatial-temporal coherence in the snippet. Besides, we design a concise mixed +training paradigm for ShowMaker, boosting its capacity for both T2V generation +and prediction. Finally, the extensive experiments show that our method +achieves state-of-the-art performance on zero-shot T2V generation and +prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs +from open-world descriptions, without loss of video coherence on script and +actor. The code and model is all available at +https://github.com/zhuangshaobin/Vlogger.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']" +KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation,Ruida Zhang · Chenyangguang Zhang · Yan Di · Fabian Manhardt · Xingyu Liu · Federico Tombari · Xiangyang Ji, ,https://arxiv.org/abs/2403.10099,,2403.10099.pdf,KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation,"In this paper, we present KP-RED, a unified KeyPoint-driven REtrieval and +Deformation framework that takes object scans as input and jointly retrieves +and deforms the most geometrically similar CAD models from a pre-processed +database to tightly match the target. Unlike existing dense matching based +methods that typically struggle with noisy partial scans, we propose to +leverage category-consistent sparse keypoints to naturally handle both full and +partial object scans. Specifically, we first employ a lightweight retrieval +module to establish a keypoint-based embedding space, measuring the similarity +among objects by dynamically aggregating deformation-aware local-global +features around extracted keypoints. Objects that are close in the embedding +space are considered similar in geometry. Then we introduce the neural +cage-based deformation module that estimates the influence vector of each +keypoint upon cage vertices inside its local support region to control the +deformation of the retrieved shape. Extensive experiments on the synthetic +dataset PartNet and the real-world dataset Scan2CAD demonstrate that KP-RED +surpasses existing state-of-the-art approaches by a large margin. Codes and +trained models will be released in https://github.com/lolrudy/KP-RED.",cs.CV,['cs.CV'] +AssistGUI: Task-Oriented PC Graphical User Interface Automation,Difei Gao · Lei Ji · Zechen Bai · Mingyu Ouyang · Peiran Li · Dongxing Mao · Qin WU · Weichen Zhang · Peiyi Wang · Xiangwu Guo · Hengxu Wang · Luowei Zhou · Mike Zheng Shou, ,https://arxiv.org/abs/2312.13108,,2312.13108.pdf,ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation,"Graphical User Interface (GUI) automation holds significant promise for +assisting users with complex tasks, thereby boosting human productivity. +Existing works leveraging Large Language Model (LLM) or LLM-based AI agents +have shown capabilities in automating tasks on Android and Web platforms. +However, these tasks are primarily aimed at simple device usage and +entertainment operations. This paper presents a novel benchmark, AssistGUI, to +evaluate whether models are capable of manipulating the mouse and keyboard on +the Windows platform in response to user-requested tasks. We carefully +collected a set of 100 tasks from nine widely-used software applications, such +as, After Effects and MS Word, each accompanied by the necessary project files +for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied +Agent framework, which incorporates a sophisticated GUI parser driven by an +LLM-agent and an enhanced reasoning mechanism adept at handling lengthy +procedural tasks. Our experimental results reveal that our GUI Parser and +Reasoning mechanism outshine existing methods in performance. Nevertheless, the +potential remains substantial, with the best model attaining only a 46% success +rate on our benchmark. We conclude with a thorough analysis of the current +methods' limitations, setting the stage for future breakthroughs in this +domain.",cs.CV,['cs.CV'] +MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision,Chenyangguang Zhang · Guanlong Jiao · Yan Di · Gu Wang · Ziqin Huang · Ruida Zhang · Fabian Manhardt · Bowen Fu · Federico Tombari · Xiangyang Ji, ,https://arxiv.org/abs/2310.11696,,2310.11696.pdf,MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision,"Previous works concerning single-view hand-held object reconstruction +typically rely on supervision from 3D ground-truth models, which are hard to +collect in real world. In contrast, readily accessible hand-object videos offer +a promising training data source, but they only give heavily occluded object +observations. In this paper, we present a novel synthetic-to-real framework to +exploit Multi-view Occlusion-aware supervision from hand-object videos for +Hand-held Object reconstruction (MOHO) from a single image, tackling two +predominant challenges in such setting: hand-induced occlusion and object's +self-occlusion. First, in the synthetic pre-training stage, we render a +large-scaled synthetic dataset SOMVideo with hand-object images and multi-view +occlusion-free supervisions, adopted to address hand-induced occlusion in both +2D and 3D spaces. Second, in the real-world finetuning stage, MOHO leverages +the amodal-mask-weighted geometric supervision to mitigate the unfaithful +guidance caused by the hand-occluded supervising views in real world. Moreover, +domain-consistent occlusion-aware features are amalgamated in MOHO to resist +object's self-occlusion for inferring the complete object shape. Extensive +experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains +superior results against 3D-supervised methods by a large margin.",cs.CV,['cs.CV'] +Text-Guided 3D Face Synthesis - From Generation to Editing,Yunjie Wu · Yapeng Meng · Zhipeng Hu · Lincheng Li · Haoqian Wu · Kun Zhou · Weiwei Xu · Xin Yu, ,https://arxiv.org/abs/2312.00375,,2312.00375.pdf,Text-Guided 3D Face Synthesis -- From Generation to Editing,"Text-guided 3D face synthesis has achieved remarkable results by leveraging +text-to-image (T2I) diffusion models. However, most existing works focus solely +on the direct generation, ignoring the editing, restricting them from +synthesizing customized 3D faces through iterative adjustments. In this paper, +we propose a unified text-guided framework from face generation to editing. In +the generation stage, we propose a geometry-texture decoupled generation to +mitigate the loss of geometric details caused by coupling. Besides, decoupling +enables us to utilize the generated geometry as a condition for texture +generation, yielding highly geometry-texture aligned results. We further employ +a fine-tuned texture diffusion model to enhance texture quality in both RGB and +YUV space. In the editing stage, we first employ a pre-trained diffusion model +to update facial geometry or texture based on the texts. To enable sequential +editing, we introduce a UV domain consistency preservation regularization, +preventing unintentional changes to irrelevant facial attributes. Besides, we +propose a self-guided consistency weight strategy to improve editing efficacy +while preserving consistency. Through comprehensive experiments, we showcase +our method's superiority in face synthesis. Project page: +https://faceg2e.github.io/.",cs.CV,['cs.CV'] +Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval,Zhen-Duo Chen · Li-Jun Zhao · Zi-Chao Zhang · Xin Luo · Xin-Shun Xu, ,https://arxiv.org/abs/2311.06067,,2311.06067.pdf,Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval,"In recent years, hashing methods have been popular in the large-scale media +search for low storage and strong representation capabilities. To describe +objects with similar overall appearance but subtle differences, more and more +studies focus on hashing-based fine-grained image retrieval. Existing hashing +networks usually generate both local and global features through attention +guidance on the same deep activation tensor, which limits the diversity of +feature representations. To handle this limitation, we substitute convolutional +descriptors for attention-guided features and propose an Attributes Grouping +and Mining Hashing (AGMH), which groups and embeds the category-specific visual +attributes in multiple descriptors to generate a comprehensive feature +representation for efficient fine-grained image retrieval. Specifically, an +Attention Dispersion Loss (ADL) is designed to force the descriptors to attend +to various local regions and capture diverse subtle details. Moreover, we +propose a Stepwise Interactive External Attention (SIEA) to mine critical +attributes in each descriptor and construct correlations between fine-grained +attributes and objects. The attention mechanism is dedicated to learning +discrete attributes, which will not cost additional computations in hash codes +generation. Finally, the compact binary codes are learned by preserving +pairwise similarities. Experimental results demonstrate that AGMH consistently +yields the best performance against state-of-the-art methods on fine-grained +benchmark datasets.",cs.IR,"['cs.IR', 'cs.AI', 'cs.CV']" +VOODOO 3D: VOlumetric pOrtrait Disentanglement fOr Online 3D head reenactment,Phong Tran · Egor Zakharov · Long Nhat Ho · Anh Tran · Liwen Hu · Hao Li, ,https://arxiv.org/abs/2312.04651,,2312.04651.pdf,VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment,"We present a 3D-aware one-shot head reenactment method based on a fully +volumetric neural disentanglement framework for source appearance and driver +expressions. Our method is real-time and produces high-fidelity and +view-consistent output, suitable for 3D teleconferencing systems based on +holographic displays. Existing cutting-edge 3D-aware reenactment methods often +use neural radiance fields or 3D meshes to produce view-consistent appearance +encoding, but, at the same time, they rely on linear face models, such as 3DMM, +to achieve its disentanglement with facial expressions. As a result, their +reenactment results often exhibit identity leakage from the driver or have +unnatural expressions. To address these problems, we propose a neural +self-supervised disentanglement approach that lifts both the source image and +driver video frame into a shared 3D volumetric representation based on +tri-planes. This representation can then be freely manipulated with expression +tri-planes extracted from the driving images and rendered from an arbitrary +view using neural radiance fields. We achieve this disentanglement via +self-supervised learning on a large in-the-wild video dataset. We further +introduce a highly effective fine-tuning approach to improve the +generalizability of the 3D lifting using the same real-world data. We +demonstrate state-of-the-art performance on a wide range of datasets, and also +showcase high-quality 3D-aware head reenactment on highly challenging and +diverse subjects, including non-frontal head poses and complex expressions for +both source and driver.",cs.CV,['cs.CV'] +Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence,Junyi Zhang · Charles Herrmann · Junhwa Hur · Eric Chen · Varun Jampani · Deqing Sun · Ming-Hsuan Yang,telling-left-from-right.github.io,https://arxiv.org/abs/2311.17034,,2311.17034.pdf,Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence,"While pre-trained large-scale vision models have shown significant promise +for semantic correspondence, their features often struggle to grasp the +geometry and orientation of instances. This paper identifies the importance of +being geometry-aware for semantic correspondence and reveals a limitation of +the features of current foundation models under simple post-processing. We show +that incorporating this information can markedly enhance semantic +correspondence performance with simple but effective solutions in both +zero-shot and supervised settings. We also construct a new challenging +benchmark for semantic correspondence built from an existing animal pose +estimation dataset, for both pre-training validating models. Our method +achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the +challenging SPair-71k dataset, outperforming the state of the art by 5.5p and +11.0p absolute gains, respectively. Our code and datasets are publicly +available at: https://telling-left-from-right.github.io/.",cs.CV,['cs.CV'] +Federated Generalized Category Discovery,Nan Pu · Wenjing Li · Xinyuan Ji · Yalan Qin · Nicu Sebe · Zhun Zhong, ,https://arxiv.org/abs/2403.07369,,2403.07369.pdf,Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery,"In this paper, we study the problem of Generalized Category Discovery (GCD), +which aims to cluster unlabeled data from both known and unknown categories +using the knowledge of labeled data from known categories. Current GCD methods +rely on only visual cues, which however neglect the multi-modality perceptive +nature of human cognitive processes in discovering novel visual categories. To +address this, we propose a two-phase TextGCD framework to accomplish +multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD +mainly includes a retrieval-based text generation (RTG) phase and a +cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon +using category tags from diverse datasets and attributes from Large Language +Models, generating descriptive texts for images in a retrieval manner. Second, +CCT leverages disparities between textual and visual modalities to foster +mutual learning, thereby enhancing visual GCD. In addition, we design an +adaptive class aligning strategy to ensure the alignment of category +perceptions between modalities as well as a soft-voting mechanism to integrate +multi-modality cues. Experiments on eight datasets show the large superiority +of our approach over state-of-the-art methods. Notably, our approach +outperforms the best competitor, by 7.7% and 10.8% in All accuracy on +ImageNet-1k and CUB, respectively.",cs.CV,['cs.CV'] +LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes,Shanlin Sun · Bingbing Zhuang · Ziyu Jiang · Buyu Liu · Xiaohui Xie · Manmohan Chandraker, ,https://arxiv.org/abs/2405.00900,,2405.00900.pdf,LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes,"Photorealistic simulation plays a crucial role in applications such as +autonomous driving, where advances in neural radiance fields (NeRFs) may allow +better scalability through the automatic creation of digital 3D assets. +However, reconstruction quality suffers on street scenes due to largely +collinear camera motions and sparser samplings at higher speeds. On the other +hand, the application often demands rendering from camera views that deviate +from the inputs to accurately simulate behaviors like lane changes. In this +paper, we propose several insights that allow a better utilization of Lidar +data to improve NeRF quality on street scenes. First, our framework learns a +geometric scene representation from Lidar, which is fused with the implicit +grid-based representation for radiance decoding, thereby supplying stronger +geometric information offered by explicit point cloud. Second, we put forth a +robust occlusion-aware depth supervision scheme, which allows utilizing +densified Lidar points by accumulation. Third, we generate augmented training +views from Lidar points for further improvement. Our insights translate to +largely improved novel view synthesis under real driving scenes.",cs.CV,['cs.CV'] +Learning Occupancy for Monocular 3D Object Detection,Liang Peng · Junkai Xu · Haoran Cheng · Zheng Yang · Xiaopei Wu · Wei Qian · Wenxiao Wang · Boxi Wu · Deng Cai, ,https://arxiv.org/abs/2308.09421,,2308.09421.pdf,MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection,"In the field of monocular 3D detection, it is common practice to utilize +scene geometric clues to enhance the detector's performance. However, many +existing works adopt these clues explicitly such as estimating a depth map and +back-projecting it into 3D space. This explicit methodology induces sparsity in +3D representations due to the increased dimensionality from 2D to 3D, and leads +to substantial information loss, especially for distant and occluded objects. +To alleviate this issue, we propose MonoNeRD, a novel detection framework that +can infer dense 3D geometry and occupancy. Specifically, we model scenes with +Signed Distance Functions (SDF), facilitating the production of dense 3D +representations. We treat these representations as Neural Radiance Fields +(NeRF) and then employ volume rendering to recover RGB images and depth maps. +To the best of our knowledge, this work is the first to introduce volume +rendering for M3D, and demonstrates the potential of implicit reconstruction +for image-based 3D perception. Extensive experiments conducted on the KITTI-3D +benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. +Codes are available at https://github.com/cskkxjk/MonoNeRD.",cs.CV,['cs.CV'] +CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving,Mozhgan Pourkeshavarz · Junrui Zhang · Amir Rasouli, ,https://arxiv.org/abs/2404.12538,,2404.12538.pdf,TrACT: A Training Dynamics Aware Contrastive Learning Framework for Long-tail Trajectory Prediction,"As a safety critical task, autonomous driving requires accurate predictions +of road users' future trajectories for safe motion planning, particularly under +challenging conditions. Yet, many recent deep learning methods suffer from a +degraded performance on the challenging scenarios, mainly because these +scenarios appear less frequently in the training data. To address such a +long-tail issue, existing methods force challenging scenarios closer together +in the feature space during training to trigger information sharing among them +for more robust learning. These methods, however, primarily rely on the motion +patterns to characterize scenarios, omitting more informative contextual +information, such as interactions and scene layout. We argue that exploiting +such information not only improves prediction accuracy but also scene +compliance of the generated trajectories. In this paper, we propose to +incorporate richer training dynamics information into a prototypical +contrastive learning framework. More specifically, we propose a two-stage +process. First, we generate rich contextual features using a baseline +encoder-decoder framework. These features are split into clusters based on the +model's output errors, using the training dynamics information, and a prototype +is computed within each cluster. Second, we retrain the model using the +prototypes in a contrastive learning framework. We conduct empirical +evaluations of our approach using two large-scale naturalistic datasets and +show that our method achieves state-of-the-art performance by improving +accuracy and scene compliance on the long-tail samples. Furthermore, we perform +experiments on a subset of the clusters to highlight the additional benefit of +our approach in reducing training bias.",cs.CV,"['cs.CV', 'cs.LG']" +Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings,Yakun Chang · Yeliduosi Xiaokaiti · Yujia Liu · Bin Fan · Zhaojun Huang · Tiejun Huang · Boxin Shi, ,https://arxiv.org/abs/2405.00244,,,Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network,"As an important and practical way to obtain high dynamic range (HDR) video, +HDR video reconstruction from sequences with alternating exposures is still +less explored, mainly due to the lack of large-scale real-world datasets. +Existing methods are mostly trained on synthetic datasets, which perform poorly +in real scenes. In this work, to facilitate the development of real-world HDR +video reconstruction, we present Real-HDRV, a large-scale real-world benchmark +dataset for HDR video reconstruction, featuring various scenes, diverse motion +patterns, and high-quality labels. Specifically, our dataset contains 500 +LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels, +covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge, +our dataset is the largest real-world HDR video reconstruction dataset. +Correspondingly, we propose an end-to-end network for HDR video reconstruction, +where a novel two-stage strategy is designed to perform alignment sequentially. +Specifically, the first stage performs global alignment with the adaptively +estimated global offsets, reducing the difficulty of subsequent alignment. The +second stage implicitly performs local alignment in a coarse-to-fine manner at +the feature level using the adaptive separable convolution. Extensive +experiments demonstrate that: (1) models trained on our dataset can achieve +better performance on real scenes than those trained on synthetic datasets; (2) +our method outperforms previous state-of-the-art methods. Our dataset is +available at https://github.com/yungsyu99/Real-HDRV.",cs.CV,['cs.CV'] +Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression,Hancheng Ye · Chong Yu · Peng Ye · Renqiu Xia · Bo Zhang · Yansong Tang · Jiwen Lu · Tao Chen, ,https://arxiv.org/abs/2403.15835,,2403.15835.pdf,Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression,"Recent Vision Transformer Compression (VTC) works mainly follow a two-stage +scheme, where the importance score of each model unit is first evaluated or +preset in each submodule, followed by the sparsity score evaluation according +to the target sparsity constraint. Such a separate evaluation process induces +the gap between importance and sparsity score distributions, thus causing high +search costs for VTC. In this work, for the first time, we investigate how to +integrate the evaluations of importance and sparsity scores into a single +stage, searching the optimal subnets in an efficient manner. Specifically, we +present OFB, a cost-efficient approach that simultaneously evaluates both +importance and sparsity scores, termed Once for Both (OFB), for VTC. First, a +bi-mask scheme is developed by entangling the importance score and the +differentiable sparsity score to jointly determine the pruning potential +(prunability) of each unit. Such a bi-mask search strategy is further used +together with a proposed adaptive one-hot loss to realize the +progressive-and-efficient search for the most important subnet. Finally, +Progressive Masked Image Modeling (PMIM) is proposed to regularize the feature +space to be more representative during the search process, which may be +degraded by the dimension reduction. Extensive experiments demonstrate that OFB +can achieve superior compression performance over state-of-the-art +searching-based and pruning-based methods under various Vision Transformer +architectures, meanwhile promoting search efficiency significantly, e.g., +costing one GPU search day for the compression of DeiT-S on ImageNet-1K.",cs.CV,['cs.CV'] +GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians,Liangxiao Hu · Hongwen Zhang · Yuxiang Zhang · Boyao ZHOU · Boning Liu · Shengping Zhang · Liqiang Nie,https://huliangxiao.github.io/GaussianAvatar,https://arxiv.org/abs/2312.02134,,2312.02134.pdf,GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians,"We present GaussianAvatar, an efficient approach to creating realistic human +avatars with dynamic 3D appearances from a single video. We start by +introducing animatable 3D Gaussians to explicitly represent humans in various +poses and clothing styles. Such an explicit and animatable representation can +fuse 3D appearances more efficiently and consistently from 2D observations. Our +representation is further augmented with dynamic properties to support +pose-dependent appearance modeling, where a dynamic appearance network along +with an optimizable feature tensor is designed to learn the +motion-to-appearance mapping. Moreover, by leveraging the differentiable motion +condition, our method enables a joint optimization of motions and appearances +during avatar modeling, which helps to tackle the long-standing issue of +inaccurate motion estimation in monocular settings. The efficacy of +GaussianAvatar is validated on both the public dataset and our collected +dataset, demonstrating its superior performances in terms of appearance quality +and rendering efficiency.",cs.CV,['cs.CV'] +OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning,Lingyi Hong · Shilin Yan · Renrui Zhang · Wanyun Li · Xinyu Zhou · Pinxue Guo · Kaixun Jiang · Yiting Cheng · Jinglun Li · Zhaoyu Chen · Wenqiang Zhang, ,https://arxiv.org/abs/2403.09634,,2403.09634.pdf,OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning,"Visual object tracking aims to localize the target object of each frame based +on its initial appearance in the first frame. Depending on the input modility, +tracking tasks can be divided into RGB tracking and RGB+X (e.g. RGB+N, and +RGB+D) tracking. Despite the different input modalities, the core aspect of +tracking is the temporal matching. Based on this common ground, we present a +general framework to unify various tracking tasks, termed as OneTracker. +OneTracker first performs a large-scale pre-training on a RGB tracker called +Foundation Tracker. This pretraining phase equips the Foundation Tracker with a +stable ability to estimate the location of the target object. Then we regard +other modality information as prompt and build Prompt Tracker upon Foundation +Tracker. Through freezing the Foundation Tracker and only adjusting some +additional trainable parameters, Prompt Tracker inhibits the strong +localization ability from Foundation Tracker and achieves parameter-efficient +finetuning on downstream RGB+X tracking tasks. To evaluate the effectiveness of +our general framework OneTracker, which is consisted of Foundation Tracker and +Prompt Tracker, we conduct extensive experiments on 6 popular tracking tasks +across 11 benchmarks and our OneTracker outperforms other models and achieves +state-of-the-art performance.",cs.CV,['cs.CV'] +TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation,Hoonhee Cho · Taewoo Kim · Yuhwan Jeong · Kuk-Jin Yoon, ,https://arxiv.org/abs/2404.18156,,,Event-based Video Frame Interpolation with Edge Guided Motion Refinement,"Video frame interpolation, the process of synthesizing intermediate frames +between sequential video frames, has made remarkable progress with the use of +event cameras. These sensors, with microsecond-level temporal resolution, fill +information gaps between frames by providing precise motion cues. However, +contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often +neglect the fact that event data primarily supply high-confidence features at +scene edges during multi-modal feature fusion, thereby diminishing the role of +event signals in optical flow (OF) estimation and warping refinement. To +address this overlooked aspect, we introduce an end-to-end E-VFI learning +method (referred to as EGMR) to efficiently utilize edge features from event +signals for motion flow and warping enhancement. Our method incorporates an +Edge Guided Attentive (EGA) module, which rectifies estimated video motion +through attentive aggregation based on the local correlation of multi-modal +features in a coarse-to-fine strategy. Moreover, given that event data can +provide accurate visual references at scene edges between consecutive frames, +we introduce a learned visibility map derived from event data to adaptively +mitigate the occlusion problem in the warping refinement process. Extensive +experiments on both synthetic and real datasets show the effectiveness of the +proposed approach, demonstrating its potential for higher quality video frame +interpolation.",cs.CV,['cs.CV'] +GigaTraj: Predicting Long-term Trajectories of Hundreds of Pedestrians in Gigapixel Complex Scenes,Haozhe Lin · Chunyu Wei · Li He · Yuchen Guo · Yuchy Zhao · Shanglong Li · Lu Fang, ,https://arxiv.org/abs/2402.19002,,2402.19002.pdf,GoalNet: Goal Areas Oriented Pedestrian Trajectory Prediction,"Predicting the future trajectories of pedestrians on the road is an important +task for autonomous driving. The pedestrian trajectory prediction is affected +by scene paths, pedestrian's intentions and decision-making, which is a +multi-modal problem. Most recent studies use past trajectories to predict a +variety of potential future trajectory distributions, which do not account for +the scene context and pedestrian targets. Instead of predicting the future +trajectory directly, we propose to use scene context and observed trajectory to +predict the goal points first, and then reuse the goal points to predict the +future trajectories. By leveraging the information from scene context and +observed trajectory, the uncertainty can be limited to a few target areas, +which represent the ""goals"" of the pedestrians. In this paper, we propose +GoalNet, a new trajectory prediction neural network based on the goal areas of +a pedestrian. Our network can predict both pedestrian's trajectories and +bounding boxes. The overall model is efficient and modular, and its outputs can +be changed according to the usage scenario. Experimental results show that +GoalNet significantly improves the previous state-of-the-art performance by +48.7% on the JAAD and 40.8% on the PIE dataset.",cs.CV,"['cs.CV', 'cs.AI']" +Discovering Syntactic Interaction Clues for Human-Object Interaction Detection,Jinguo Luo · Weihong Ren · Weibo Jiang · Xi'ai Chen · Qiang Wang · Zhi Han · Honghai LIU, ,,https://www.youtube.com/watch?v=YxKgZAoqzpY,,,,,nan +Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches,Qing Yu · Mikihiro Tanaka · Kent Fujiwara,https://yu1ut.com/MotionPatches-HP/,https://arxiv.org/abs/2405.04771,,2405.04771.pdf,Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches,"To build a cross-modal latent space between 3D human motion and language, +acquiring large-scale and high-quality human motion data is crucial. However, +unlike the abundance of image data, the scarcity of motion data has limited the +performance of existing motion-language models. To counter this, we introduce +""motion patches"", a new representation of motion sequences, and propose using +Vision Transformers (ViT) as motion encoders via transfer learning, aiming to +extract useful knowledge from the image domain and apply it to the motion +domain. These motion patches, created by dividing and sorting skeleton joints +based on body parts in motion sequences, are robust to varying skeleton +structures, and can be regarded as color image patches in ViT. We find that +transfer learning with pre-trained weights of ViT obtained through training +with 2D image data can boost the performance of motion analysis, presenting a +promising direction for addressing the issue of limited motion data. Our +extensive experiments show that the proposed motion patches, used jointly with +ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion +retrieval, and other novel challenging tasks, such as cross-skeleton +recognition, zero-shot motion classification, and human interaction +recognition, which are currently impeded by the lack of data.",cs.CV,['cs.CV'] +ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification,Jiangbo Shi · Chen Li · Tieliang Gong · Yefeng Zheng · Huazhu Fu, ,https://arxiv.org/abs/2312.01099,,2312.01099.pdf,Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Bag-Level Classifier is a Good Instance-Level Teacher,"Multiple Instance Learning (MIL) has demonstrated promise in Whole Slide +Image (WSI) classification. However, a major challenge persists due to the high +computational cost associated with processing these gigapixel images. Existing +methods generally adopt a two-stage approach, comprising a non-learnable +feature embedding stage and a classifier training stage. Though it can greatly +reduce the memory consumption by using a fixed feature embedder pre-trained on +other domains, such scheme also results in a disparity between the two stages, +leading to suboptimal classification accuracy. To address this issue, we +propose that a bag-level classifier can be a good instance-level teacher. Based +on this idea, we design Iteratively Coupled Multiple Instance Learning (ICMIL) +to couple the embedder and the bag classifier at a low cost. ICMIL initially +fix the patch embedder to train the bag classifier, followed by fixing the bag +classifier to fine-tune the patch embedder. The refined embedder can then +generate better representations in return, leading to a more accurate +classifier for the next iteration. To realize more flexible and more effective +embedder fine-tuning, we also introduce a teacher-student framework to +efficiently distill the category knowledge in the bag classifier to help the +instance-level embedder fine-tuning. Thorough experiments were conducted on +four distinct datasets to validate the effectiveness of ICMIL. The experimental +results consistently demonstrate that our method significantly improves the +performance of existing MIL backbones, achieving state-of-the-art results. The +code is available at: https://github.com/Dootmaan/ICMIL/tree/confidence_based",cs.CV,['cs.CV'] +Neural Visibility Field for Uncertainty-Driven Active Mapping,Shangjie Xue · Jesse Dill · Pranay Mathur · Frank Dellaert · Panagiotis Tsiotras · Danfei Xu, ,http://export.arxiv.org/abs/2308.16246,,2308.16246.pdf,Active Neural Mapping,"We address the problem of active mapping with a continually-learned neural +scene representation, namely Active Neural Mapping. The key lies in actively +finding the target space to be explored with efficient agent movement, thus +minimizing the map uncertainty on-the-fly within a previously unseen +environment. In this paper, we examine the weight space of the +continually-learned neural field, and show empirically that the neural +variability, the prediction robustness against random weight perturbation, can +be directly utilized to measure the instant uncertainty of the neural map. +Together with the continuous geometric information inherited in the neural map, +the agent can be guided to find a traversable path to gradually gain knowledge +of the environment. We present for the first time an active mapping system with +a coordinate-based implicit neural representation for online scene +reconstruction. Experiments in the visually-realistic Gibson and Matterport3D +environment demonstrate the efficacy of the proposed method.",cs.CV,['cs.CV'] +Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors,Haoxuanye Ji · Pengpeng Liang · Erkang Cheng, ,https://arxiv.org/abs/2403.06093,,2403.06093.pdf,Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors,"Multi-camera-based 3D object detection has made notable progress in the past +several years. However, we observe that there are cases (e.g. faraway regions) +in which popular 2D object detectors are more reliable than state-of-the-art 3D +detectors. In this paper, to improve the performance of query-based 3D object +detectors, we present a novel query generating approach termed QAF2D, which +infers 3D query anchors from 2D detection results. A 2D bounding box of an +object in an image is lifted to a set of 3D anchors by associating each sampled +point within the box with depth, yaw angle, and size candidates. Then, the +validity of each 3D anchor is verified by comparing its projection in the image +with its corresponding 2D box, and only valid anchors are kept and used to +construct queries. The class information of the 2D bounding box associated with +each query is also utilized to match the predicted boxes with ground truth for +the set-based loss. The image feature extraction backbone is shared between the +3D detector and 2D detector by adding a small number of prompt parameters. We +integrate QAF2D into three popular query-based 3D object detectors and carry +out comprehensive evaluations on the nuScenes dataset. The largest improvement +that QAF2D can bring about on the nuScenes validation subset is $2.3\%$ NDS and +$2.7\%$ mAP. Code is available at https://github.com/nullmax-vision/QAF2D.",cs.CV,['cs.CV'] +Resolution Limit of Single-Photon LIDAR,Stanley H. Chan · Hashan K Weerasooriya · Weijian Zhang · Pamela Abshire · Istvan Gyongy · Robert Henderson, ,https://arxiv.org/abs/2403.17719,,2403.17719.pdf,Resolution Limit of Single-Photon LiDAR,"Single-photon Light Detection and Ranging (LiDAR) systems are often equipped +with an array of detectors for improved spatial resolution and sensing speed. +However, given a fixed amount of flux produced by the laser transmitter across +the scene, the per-pixel Signal-to-Noise Ratio (SNR) will decrease when more +pixels are packed in a unit space. This presents a fundamental trade-off +between the spatial resolution of the sensor array and the SNR received at each +pixel. Theoretical characterization of this fundamental limit is explored. By +deriving the photon arrival statistics and introducing a series of new +approximation techniques, the Mean Squared Error (MSE) of the +maximum-likelihood estimator of the time delay is derived. The theoretical +predictions align well with simulations and real data.",eess.SP,"['eess.SP', 'cs.CV']" +VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models,Xiang Li · Qianli Shen · Kenji Kawaguchi, ,https://arxiv.org/abs/2312.00057,,2312.00057.pdf,VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models,"The booming use of text-to-image generative models has raised concerns about +their high risk of producing copyright-infringing content. While probabilistic +copyright protection methods provide a probabilistic guarantee against such +infringement, in this paper, we introduce Virtually Assured Amplification +Attack (VA3), a novel online attack framework that exposes the vulnerabilities +of these protection mechanisms. The proposed framework significantly amplifies +the probability of generating infringing content on the sustained interactions +with generative models and a non-trivial lower-bound on the success probability +of each engagement. Our theoretical and experimental results demonstrate the +effectiveness of our approach under various scenarios. These findings highlight +the potential risk of implementing probabilistic copyright protection in +practical applications of text-to-image generative models. Code is available at +https://github.com/South7X/VA3.",cs.CR,"['cs.CR', 'cs.AI', 'cs.CV', 'cs.MM']" +Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion,Fan Zhang · Shaodi You · Yu Li · Ying Fu,https://github.com/zkawfanx/Atlantis,https://arxiv.org/abs/2312.12471,,2312.12471.pdf,Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion,"Monocular depth estimation has experienced significant progress on +terrestrial images in recent years, largely due to deep learning advancements. +However, it remains inadequate for underwater scenes, primarily because of data +scarcity. Given the inherent challenges of light attenuation and backscattering +in water, acquiring clear underwater images or precise depth information is +notably difficult and costly. Consequently, learning-based approaches often +rely on synthetic data or turn to unsupervised or self-supervised methods to +mitigate this lack of data. Nonetheless, the performance of these methods is +often constrained by the domain gap and looser constraints. In this paper, we +propose a novel pipeline for generating photorealistic underwater images using +accurate terrestrial depth data. This approach facilitates the training of +supervised models for underwater depth estimation, effectively reducing the +performance disparity between terrestrial and underwater environments. Contrary +to prior synthetic datasets that merely apply style transfer to terrestrial +images without altering the scene content, our approach uniquely creates +vibrant, non-existent underwater scenes by leveraging terrestrial depth data +through the innovative Stable Diffusion model. Specifically, we introduce a +unique Depth2Underwater ControlNet, trained on specially prepared \{Underwater, +Depth, Text\} data triplets, for this generation task. Our newly developed +dataset enables terrestrial depth estimation models to achieve considerable +improvements, both quantitatively and qualitatively, on unseen underwater +images, surpassing their terrestrial pre-trained counterparts. Moreover, the +enhanced depth accuracy for underwater scenes also aids underwater image +restoration techniques that rely on depth maps, further demonstrating our +dataset's utility. The dataset will be available at +https://github.com/zkawfanx/Atlantis.",cs.CV,['cs.CV'] +ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations,Maitreya Patel · Changhoon Kim · Sheng Cheng · Chitta Baral · 'YZ' Yezhou Yang, ,https://arxiv.org/abs/2312.04655,,2312.04655.pdf,ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations,"Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g., +DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional +T2I benchmarks, at the cost of significant computational resources. The unCLIP +stack comprises T2I prior and diffusion image decoder. The T2I prior model +alone adds a billion parameters compared to the Latent Diffusion Models, which +increases the computational and high-quality data requirements. We introduce +ECLIPSE, a novel contrastive learning method that is both parameter and +data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g., +CLIP) to distill the knowledge into the prior model. We demonstrate that the +ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere +2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% +preference score under resource-limited setting. It also attains performance on +par with SOTA big models, achieving an average of 63.36% preference score in +terms of the ability to follow the text compositions. Extensive experiments on +two unCLIP diffusion image decoders, Karlo and Kandinsky, affirm that ECLIPSE +priors consistently deliver high performance while significantly reducing +resource dependency.",cs.CV,['cs.CV'] +Hallucination Augmented Contrastive Learning for Multimodal Large Language Model,Chaoya Jiang · Haiyang Xu · Mengfan Dong · Jiaxing Chen · Wei Ye · Ming Yan · Qinghao Ye · Ji Zhang · Fei Huang · Fei Huang · Shikun Zhang, ,https://arxiv.org/abs/2312.06968,,2312.06968.pdf,Hallucination Augmented Contrastive Learning for Multimodal Large Language Model,"Multi-modal large language models (MLLMs) have been shown to efficiently +integrate natural language with visual information to handle multi-modal tasks. +However, MLLMs still face a fundamental limitation of hallucinations, where +they tend to generate erroneous or fabricated information. In this paper, we +address hallucinations in MLLMs from a novel perspective of representation +learning. We first analyzed the representation distribution of textual and +visual tokens in MLLM, revealing two important findings: 1) there is a +significant gap between textual and visual representations, indicating +unsatisfactory cross-modal representation alignment; 2) representations of +texts that contain and do not contain hallucinations are entangled, making it +challenging to distinguish them. These two observations inspire us with a +simple yet effective method to mitigate hallucinations. Specifically, we +introduce contrastive learning into MLLMs and use text with hallucination as +hard negative examples, naturally bringing representations of non-hallucinative +text and visual samples closer while pushing way representations of +non-hallucinating and hallucinative text. We evaluate our method quantitatively +and qualitatively, showing its effectiveness in reducing hallucination +occurrences and improving performance across multiple benchmarks. On the +MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the +baseline MiniGPT-4/LLaVA. Our code is available on +https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.",cs.CV,['cs.CV'] +Active Domain Adaptation with False Negative Prediction for Object Detection,Yuzuru Nakamura · Yasunori Ishii · Takayoshi Yamashita, ,https://arxiv.org/abs/2307.07944,,2307.07944.pdf,"Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-balanced Pseudo-Labeling","Unsupervised domain adaptation (DA) with the aid of pseudo labeling +techniques has emerged as a crucial approach for domain-adaptive 3D object +detection. While effective, existing DA methods suffer from a substantial drop +in performance when applied to a multi-class training setting, due to the +co-existence of low-quality pseudo labels and class imbalance issues. In this +paper, we address this challenge by proposing a novel ReDB framework tailored +for learning to detect all classes at once. Our approach produces Reliable, +Diverse, and class-Balanced pseudo 3D boxes to iteratively guide the +self-training on a distributionally different target domain. To alleviate +disruptions caused by the environmental discrepancy (e.g., beam numbers), the +proposed cross-domain examination (CDE) assesses the correctness of pseudo +labels by copy-pasting target instances into a source environment and measuring +the prediction consistency. To reduce computational overhead and mitigate the +object shift (e.g., scales and point densities), we design an overlapped boxes +counting (OBC) metric that allows to uniformly downsample pseudo-labeled +objects across different geometric characteristics. To confront the issue of +inter-class imbalance, we progressively augment the target point clouds with a +class-balanced set of pseudo-labeled target instances and source objects, which +boosts recognition accuracies on both frequently appearing and rare classes. +Experimental results on three benchmark datasets using both voxel-based (i.e., +SECOND) and point-based 3D detectors (i.e., PointRCNN) demonstrate that our +proposed ReDB approach outperforms existing 3D domain adaptation methods by a +large margin, improving 23.15% mAP on the nuScenes $\rightarrow$ KITTI task. +The code is available at https://github.com/zhuoxiao-chen/ReDB-DA-3Ddet.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Material Palette: Extraction of Materials from a Single Image,Ivan Lopes · Fabio Pizzati · Raoul de Charette,https://astra-vision.github.io/MaterialPalette/,https://arxiv.org/abs/2311.17060v1,,2311.17060v1.pdf,Material Palette: Extraction of Materials from a Single Image,"In this paper, we propose a method to extract physically-based rendering +(PBR) materials from a single real-world image. We do so in two steps: first, +we map regions of the image to material concepts using a diffusion model, which +allows the sampling of texture images resembling each material in the scene. +Second, we benefit from a separate network to decompose the generated textures +into Spatially Varying BRDFs (SVBRDFs), providing us with materials ready to be +used in rendering applications. Our approach builds on existing synthetic +material libraries with SVBRDF ground truth, but also exploits a +diffusion-generated RGB texture dataset to allow generalization to new samples +using unsupervised domain adaptation (UDA). Our contributions are thoroughly +evaluated on synthetic and real-world datasets. We further demonstrate the +applicability of our method for editing 3D scenes with materials estimated from +real photographs. The code and models will be made open-source. Project page: +https://astra-vision.github.io/MaterialPalette/",cs.CV,"['cs.CV', 'cs.GR']" +DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly,Gianluca Scarpellini · Stefano Fiorini · Francesco Giuliari · Pietro Morerio · Alessio Del Bue,https://iit-pavis.github.io/DiffAssemble/,https://arxiv.org/abs/2402.19302,,2402.19302.pdf,DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly,"Reassembly tasks play a fundamental role in many fields and multiple +approaches exist to solve specific reassembly problems. In this context, we +posit that a general unified model can effectively address them all, +irrespective of the input data type (images, 3D, etc.). We introduce +DiffAssemble, a Graph Neural Network (GNN)-based architecture that learns to +solve reassembly tasks using a diffusion model formulation. Our method treats +the elements of a set, whether pieces of 2D patch or 3D object fragments, as +nodes of a spatial graph. Training is performed by introducing noise into the +position and rotation of the elements and iteratively denoising them to +reconstruct the coherent initial pose. DiffAssemble achieves state-of-the-art +(SOTA) results in most 2D and 3D reassembly tasks and is the first +learning-based approach that solves 2D puzzles for both rotation and +translation. Furthermore, we highlight its remarkable reduction in run-time, +performing 11 times faster than the quickest optimization-based method for +puzzle solving. Code available at https://github.com/IIT-PAVIS/DiffAssemble",cs.CV,['cs.CV'] +Situational Awareness Matters in 3D Vision Language Reasoning,Yunze Man · Liang-Yan Gui · Yu-Xiong Wang, ,https://arxiv.org/abs/2401.09340,,2401.09340.pdf,SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding,"3D vision-language grounding, which focuses on aligning language with the 3D +physical environment, stands as a cornerstone in the development of embodied +agents. In comparison to recent advancements in the 2D domain, grounding +language in 3D scenes faces several significant challenges: (i) the inherent +complexity of 3D scenes due to the diverse object configurations, their rich +attributes, and intricate relationships; (ii) the scarcity of paired 3D +vision-language data to support grounded learning; and (iii) the absence of a +unified learning framework to distill knowledge from grounded 3D data. In this +work, we aim to address these three major challenges in 3D vision-language by +examining the potential of systematically upscaling 3D vision-language learning +in indoor environments. We introduce the first million-scale 3D vision-language +dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising +2.5M vision-language pairs derived from both human annotations and our scalable +scene-graph-based generation approach. We demonstrate that this scaling allows +for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), +for 3D vision-language learning. Through extensive experiments, we showcase the +effectiveness of GPS by achieving state-of-the-art performance on all existing +3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is +unveiled through zero-shot transfer experiments in the challenging 3D +vision-language tasks. Project website: https://scene-verse.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.RO']" +PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness,Anh-Quan Cao · Angela Dai · Raoul de Charette,https://astra-vision.github.io/PaSCo/,https://arxiv.org/abs/2312.02158,,2312.02158.pdf,PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness,"We propose the task of Panoptic Scene Completion (PSC) which extends the +recently popular Semantic Scene Completion (SSC) task with instance-level +information to produce a richer understanding of the 3D scene. Our PSC proposal +utilizes a hybrid mask-based technique on the non-empty voxels from sparse +multi-scale completions. Whereas the SSC literature overlooks uncertainty which +is critical for robotics applications, we instead propose an efficient +ensembling to estimate both voxel-wise and instance-wise uncertainties along +PSC. This is achieved by building on a multi-input multi-output (MIMO) +strategy, while improving performance and yielding better uncertainty for +little additional compute. Additionally, we introduce a technique to aggregate +permutation-invariant mask predictions. Our experiments demonstrate that our +method surpasses all baselines in both Panoptic Scene Completion and +uncertainty estimation on three large-scale autonomous driving datasets. Our +code and data are available at https://astra-vision.github.io/PaSCo .",cs.CV,"['cs.CV', 'cs.AI']" +DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models,Yukang Cao · Yan-Pei Cao · Kai Han · Ying Shan · Kwan-Yee K. Wong,https://yukangcao.github.io/DreamAvatar/,https://arxiv.org/html/2402.17292v1,,2402.17292v1.pdf,DivAvatar: Diverse 3D Avatar Generation with a Single Prompt,"Text-to-Avatar generation has recently made significant strides due to +advancements in diffusion models. However, most existing work remains +constrained by limited diversity, producing avatars with subtle differences in +appearance for a given text prompt. We design DivAvatar, a novel framework that +generates diverse avatars, empowering 3D creatives with a multitude of distinct +and richly varied 3D avatars from a single text prompt. Different from most +existing work that exploits scene-specific 3D representations such as NeRF, +DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse +avatar generation from simply noise sampling in inference time. DivAvatar has +two key designs that help achieve generation diversity and visual quality. The +first is a noise sampling technique during training phase which is critical in +generating diverse appearances. The second is a semantic-aware zoom mechanism +and a novel depth loss, the former producing appearances of high textual +fidelity by separate fine-tuning of specific body parts and the latter +improving geometry quality greatly by smoothing the generated mesh in the +features space. Extensive experiments show that DivAvatar is highly versatile +in generating avatars of diverse appearances.",cs.CV,['cs.CV'] +Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities,AJ Piergiovanni · Isaac Noble · Dahun Kim · Michael Ryoo · Victor Gomes · Anelia Angelova, ,https://arxiv.org/abs/2311.05698,,2311.05698.pdf,Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities,"One of the main challenges of multimodal learning is the need to combine +heterogeneous modalities (e.g., video, audio, text). For example, video and +audio are obtained at much higher rates than text and are roughly aligned in +time. They are often not synchronized with text, which comes as a global +context, e.g., a title, or a description. Furthermore, video and audio inputs +are of much larger volumes, and grow as the video length increases, which +naturally requires more compute dedicated to these modalities and makes +modeling of long-range dependencies harder. + We here decouple the multimodal modeling, dividing it into separate, focused +autoregressive models, processing the inputs according to the characteristics +of the modalities. We propose a multimodal model, called Mirasol3B, consisting +of an autoregressive component for the time-synchronized modalities (audio and +video), and an autoregressive component for the context modalities which are +not necessarily aligned in time but are still sequential. To address the +long-sequences of the video-audio inputs, we propose to further partition the +video and audio sequences in consecutive snippets and autoregressively process +their representations. To that end, we propose a Combiner mechanism, which +models the audio-video information jointly within a timeframe. The Combiner +learns to extract audio and video features from raw spatio-temporal signals, +and then learns to fuse these features producing compact but expressive +representations per snippet. + Our approach achieves the state-of-the-art on well established multimodal +benchmarks, outperforming much larger models. It effectively addresses the high +computational demand of media inputs by both learning compact representations, +controlling the sequence length of the audio-video feature representations, and +modeling their dependencies in time.",cs.CV,['cs.CV'] +Discontinuity-preserving Normal Integration with Auxiliary Edges,Hyomin Kim · Yucheol Jung · Seungyong Lee, ,https://arxiv.org/abs/2404.03138,,2404.03138.pdf,Discontinuity-preserving Normal Integration with Auxiliary Edges,"Many surface reconstruction methods incorporate normal integration, which is +a process to obtain a depth map from surface gradients. In this process, the +input may represent a surface with discontinuities, e.g., due to +self-occlusion. To reconstruct an accurate depth map from the input normal map, +hidden surface gradients occurring from the jumps must be handled. To model +these jumps correctly, we design a novel discretization scheme for the domain +of normal integration. Our key idea is to introduce auxiliary edges, which +bridge between piecewise-smooth patches in the domain so that the magnitude of +hidden jumps can be explicitly expressed. Using the auxiliary edges, we design +a novel algorithm to optimize the discontinuity and the depth map from the +input normal map. Our method optimizes discontinuities by using a combination +of iterative re-weighted least squares and iterative filtering of the jump +magnitudes on auxiliary edges to provide strong sparsity regularization. +Compared to previous discontinuity-preserving normal integration methods, which +model the magnitudes of jumps only implicitly, our method reconstructs subtle +discontinuities accurately thanks to our explicit representation of jumps +allowing for strong sparsity regularization.",cs.CV,"['cs.CV', 'cs.GR', 'I.4.5']" +Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing,Yafei Zhang · Shen Zhou · Huafeng Li, ,https://arxiv.org/abs/2403.01105,,2403.01105.pdf,Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing,"Recovering a clear image from a single hazy image is an open inverse problem. +Although significant research progress has been made, most existing methods +ignore the effect that downstream tasks play in promoting upstream dehazing. +From the perspective of the haze generation mechanism, there is a potential +relationship between the depth information of the scene and the hazy image. +Based on this, we propose a dual-task collaborative mutual promotion framework +to achieve the dehazing of a single image. This framework integrates depth +estimation and dehazing by a dual-task interaction mechanism and achieves +mutual enhancement of their performance. To realize the joint optimization of +the two tasks, an alternative implementation mechanism with the difference +perception is developed. On the one hand, the difference perception between the +depth maps of the dehazing result and the ideal image is proposed to promote +the dehazing network to pay attention to the non-ideal areas of the dehazing. +On the other hand, by improving the depth estimation performance in the +difficult-to-recover areas of the hazy image, the dehazing network can +explicitly use the depth information of the hazy image to assist the clear +image recovery. To promote the depth estimation, we propose to use the +difference between the dehazed image and the ground truth to guide the depth +estimation network to focus on the dehazed unideal areas. It allows dehazing +and depth estimation to leverage their strengths in a mutually reinforcing +manner. Experimental results show that the proposed method can achieve better +performance than that of the state-of-the-art approaches.",cs.CV,['cs.CV'] +ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction,Zhicheng Zhang · Junyao Hu · Wentao Cheng · Danda Paudel · Jufeng Yang,https://zzcheng.top/ExtDM/,,https://junyaohu.github.io/publication/,,,,,nan +VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis,Linshan Wu · Linshan Wu · Jia-Xin Zhuang · Hao Chen, ,https://arxiv.org/abs/2402.17300v1,,2402.17300v1.pdf,VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis,"Self-Supervised Learning (SSL) has demonstrated promising results in 3D +medical image analysis. However, the lack of high-level semantics in +pre-training still heavily hinders the performance of downstream tasks. We +observe that 3D medical images contain relatively consistent contextual +position information, i.e., consistent geometric relations between different +organs, which leads to a potential way for us to learn consistent semantic +representations in pre-training. In this paper, we propose a +simple-yet-effective Volume Contrast (VoCo) framework to leverage the +contextual position priors for pre-training. Specifically, we first generate a +group of base crops from different regions while enforcing feature discrepancy +among them, where we employ them as class assignments of different regions. +Then, we randomly crop sub-volumes and predict them belonging to which class +(located at which region) by contrasting their similarity to different base +crops, which can be seen as predicting contextual positions of different +sub-volumes. Through this pretext task, VoCo implicitly encodes the contextual +position priors into model representations without the guidance of annotations, +enabling us to effectively improve the performance of downstream tasks that +require high-level semantics. Extensive experimental results on six downstream +tasks demonstrate the superior effectiveness of VoCo. Code will be available at +https://github.com/Luffy03/VoCo.",eess.IV,['eess.IV'] +JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models,YUNCHENG GUO · Xiaodong Gu, ,https://arxiv.org/abs/2312.01564,,2312.01564.pdf,APoLLo: Unified Adapter and Prompt Learning for Vision Language Models,"The choice of input text prompt plays a critical role in the performance of +Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a +unified multi-modal approach that combines Adapter and Prompt learning for +Vision-Language models. Our method is designed to substantially improve the +generalization capabilities of VLP models when they are fine-tuned in a +few-shot setting. We introduce trainable cross-attention-based adapter layers +in conjunction with vision and language encoders to strengthen the alignment +between the two modalities. We enforce consistency between the respective +encoder branches (receiving augmented inputs) to prevent overfitting in +downstream tasks. Our method is evaluated on three representative tasks: +generalization to novel classes, cross-dataset evaluation, and unseen domain +shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe +(SOTA) on novel classes for 10 diverse image recognition datasets.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL', 'cs.CV']" +F$^3$Loc: Fusion and Filtering for Floorplan Localization,Changan Chen · Rui Wang · Christoph Vogel · Marc Pollefeys, ,https://arxiv.org/abs/2403.03370,,2403.03370.pdf,F$^3$Loc: Fusion and Filtering for Floorplan Localization,"In this paper we propose an efficient data-driven solution to +self-localization within a floorplan. Floorplan data is readily available, +long-term persistent and inherently robust to changes in the visual appearance. +Our method does not require retraining per map and location or demand a large +database of images of the area of interest. We propose a novel probabilistic +model consisting of an observation and a novel temporal filtering module. +Operating internally with an efficient ray-based representation, the +observation module consists of a single and a multiview module to predict +horizontal depth from images and fuses their results to benefit from advantages +offered by either methodology. Our method operates on conventional consumer +hardware and overcomes a common limitation of competing methods that often +demand upright images. Our full system meets real-time requirements, while +outperforming the state-of-the-art by a significant margin.",cs.CV,"['cs.CV', 'cs.RO']" +Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance,Phuc Nguyen · Tuan Duc Ngo · Evangelos Kalogerakis · Chuang Gan · Anh Tran · Cuong Pham · Khoi Nguyen,https://open3dis.github.io/,https://arxiv.org/abs/2312.10671,,2312.10671.pdf,Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance,"We introduce Open3DIS, a novel solution designed to tackle the problem of +Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D +environments exhibit diverse shapes, scales, and colors, making precise +instance-level identification a challenging task. Recent advancements in +Open-Vocabulary scene understanding have made significant strides in this area +by employing class-agnostic 3D instance proposal networks for object +localization and learning queryable features for each 3D mask. While these +methods produce high-quality instance proposals, they struggle with identifying +small-scale and geometrically ambiguous objects. The key idea of our method is +a new module that aggregates 2D instance masks across frames and maps them to +geometrically coherent point cloud regions as high-quality object proposals +addressing the above limitations. These are then combined with 3D +class-agnostic instance proposals to include a wide range of objects in the +real world. To validate our approach, we conducted experiments on three +prominent datasets, including ScanNet200, S3DIS, and Replica, demonstrating +significant performance gains in segmenting objects with diverse categories +over the state-of-the-art approaches.",cs.CV,['cs.CV'] +Binarized Low-light Raw Video Enhancement,Gengchen Zhang · Yulun Zhang · Xin Yuan · Ying Fu, ,https://arxiv.org/abs/2403.19944,,2403.19944.pdf,Binarized Low-light Raw Video Enhancement,"Recently, deep neural networks have achieved excellent performance on +low-light raw video enhancement. However, they often come with high +computational complexity and large memory costs, which hinder their +applications on resource-limited devices. In this paper, we explore the +feasibility of applying the extremely compact binary neural network (BNN) to +low-light raw video enhancement. Nevertheless, there are two main issues with +binarizing video enhancement models. One is how to fuse the temporal +information to improve low-light denoising without complex modules. The other +is how to narrow the performance gap between binary convolutions with the full +precision ones. To address the first issue, we introduce a spatial-temporal +shift operation, which is easy-to-binarize and effective. The temporal shift +efficiently aggregates the features of neighbor frames and the spatial shift +handles the misalignment caused by the large motion in videos. For the second +issue, we present a distribution-aware binary convolution, which captures the +distribution characteristics of real-valued input and incorporates them into +plain binary convolutions to alleviate the degradation in performance. +Extensive quantitative and qualitative experiments have shown our +high-efficiency binarized low-light raw video enhancement method can attain a +promising performance.",cs.CV,"['cs.CV', 'eess.IV']" +Generating Non-Stationary Textures using Self-Rectification,Yang Zhou · Rongjun Xiao · Dani Lischinski · Daniel Cohen-Or · Hui Huang,https://vcc.tech/research/2024/TexRec,https://arxiv.org/abs/2401.02847,,2401.02847.pdf,Generating Non-Stationary Textures using Self-Rectification,"This paper addresses the challenge of example-based non-stationary texture +synthesis. We introduce a novel twostep approach wherein users first modify a +reference texture using standard image editing tools, yielding an initial rough +target for the synthesis. Subsequently, our proposed method, termed +""self-rectification"", automatically refines this target into a coherent, +seamless texture, while faithfully preserving the distinct visual +characteristics of the reference exemplar. Our method leverages a pre-trained +diffusion network, and uses self-attention mechanisms, to gradually align the +synthesized texture with the reference, ensuring the retention of the +structures in the provided target. Through experimental validation, our +approach exhibits exceptional proficiency in handling non-stationary textures, +demonstrating significant advancements in texture synthesis when compared to +existing state-of-the-art techniques. Code is available at +https://github.com/xiaorongjun000/Self-Rectification",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering,Tao Hu · Fangzhou Hong · Ziwei Liu, ,https://arxiv.org/abs/2404.01225,,2404.01225.pdf,SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering,"Dynamic human rendering from video sequences has achieved remarkable progress +by formulating the rendering as a mapping from static poses to human images. +However, existing methods focus on the human appearance reconstruction of every +single frame while the temporal motion relations are not fully explored. In +this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly +models the temporal dynamics and human appearances in a unified framework with +three key designs: 1) Surface-based motion encoding that models 4D human +motions with an efficient compact surface-based triplane. It encodes both +spatial and temporal motion relations on the dense surface manifold of a +statistical body template, which inherits body topology priors for +generalizable novel view synthesis with sparse training observations. 2) +Physical motion decoding that is designed to encourage physical motion learning +by decoding the motion triplane features at timestep t to predict both spatial +derivatives and temporal derivatives at the next timestep t+1 in the training +stage. 3) 4D appearance decoding that renders the motion triplanes into images +by an efficient volumetric surface-conditioned renderer that focuses on the +rendering of body surfaces with motion learning conditioning. Extensive +experiments validate the state-of-the-art performance of our new paradigm and +illustrate the expressiveness of surface-based motion triplanes for rendering +high-fidelity view-consistent humans with fast motions and even +motion-dependent shadows. Our project page is at: +https://taohuumd.github.io/projects/SurMo/",cs.CV,['cs.CV'] +MultiDiff: Consistent Novel View Synthesis from a Single Image,Norman Müller · Katja Schwarz · Katja Schwarz · Barbara Roessle · Lorenzo Porzi · Samuel Rota Bulò · Matthias Nießner · Peter Kontschieder, ,,https://sirwyver.github.io/publications/,,,,,nan +Vector Graphics Generation via Mutually Impulsed Dual-domain Diffusion,Zhongyin Zhao · Ye Chen · Zhangli Hu · Xuanhong Chen · Bingbing Ni, ,https://arxiv.org/abs/2312.10540,,,VecFusion: Vector Font Generation with Diffusion,"We present VecFusion, a new neural architecture that can generate vector +fonts with varying topological structures and precise control point positions. +Our approach is a cascaded diffusion model which consists of a raster diffusion +model followed by a vector diffusion model. The raster model generates +low-resolution, rasterized fonts with auxiliary control point information, +capturing the global style and shape of the font, while the vector model +synthesizes vector fonts conditioned on the low-resolution raster fonts from +the first stage. To synthesize long and complex curves, our vector diffusion +model uses a transformer architecture and a novel vector representation that +enables the modeling of diverse vector geometry and the precise prediction of +control points. Our experiments show that, in contrast to previous generative +models for vector graphics, our new cascaded vector diffusion model generates +higher quality vector fonts, with complex structures and diverse styles.",cs.CV,"['cs.CV', 'cs.GR']" +Equivariant plug-and-play image reconstruction,Matthieu Terris · Thomas Moreau · Nelly Pustelnik · Julián Tachella, ,https://arxiv.org/html/2312.01831v2,,2312.01831v2.pdf,Equivariant plug-and-play image reconstruction,"Plug-and-play algorithms constitute a popular framework for solving inverse +imaging problems that rely on the implicit definition of an image prior via a +denoiser. These algorithms can leverage powerful pre-trained denoisers to solve +a wide range of imaging tasks, circumventing the necessity to train models on a +per-task basis. Unfortunately, plug-and-play methods often show unstable +behaviors, hampering their promise of versatility and leading to suboptimal +quality of reconstructed images. In this work, we show that enforcing +equivariance to certain groups of transformations (rotations, reflections, +and/or translations) on the denoiser strongly improves the stability of the +algorithm as well as its reconstruction quality. We provide a theoretical +analysis that illustrates the role of equivariance on better performance and +stability. We present a simple algorithm that enforces equivariance on any +existing denoiser by simply applying a random transformation to the input of +the denoiser and the inverse transformation to the output at each iteration of +the algorithm. Experiments on multiple imaging modalities and denoising +networks show that the equivariant plug-and-play algorithm improves both the +reconstruction performance and the stability compared to their non-equivariant +counterparts.",eess.IV,"['eess.IV', 'cs.CV']" +"SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM",Nikhil Keetha · Jay Karhade · Krishna Murthy Jatavallabhula · Gengshan Yang · Sebastian Scherer · Deva Ramanan · Jonathon Luiten,https://spla-tam.github.io/,https://arxiv.org/abs/2312.02126,,2312.02126.pdf,"SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM","Dense simultaneous localization and mapping (SLAM) is crucial for robotics +and augmented reality applications. However, current methods are often hampered +by the non-volumetric or implicit way they represent a scene. This work +introduces SplaTAM, an approach that, for the first time, leverages explicit +volumetric representations, i.e., 3D Gaussians, to enable high-fidelity +reconstruction from a single unposed RGB-D camera, surpassing the capabilities +of existing methods. SplaTAM employs a simple online tracking and mapping +system tailored to the underlying Gaussian representation. It utilizes a +silhouette mask to elegantly capture the presence of scene density. This +combination enables several benefits over prior representations, including fast +rendering and dense optimization, quickly determining if areas have been +previously mapped, and structured map expansion by adding more Gaussians. +Extensive experiments show that SplaTAM achieves up to 2x superior performance +in camera pose estimation, map construction, and novel-view synthesis over +existing methods, paving the way for more immersive high-fidelity SLAM +applications.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring,Chengxu Liu · Xuan Wang · Xiangyu Xu · Ruhao Tian · Shuai Li · Xueming Qian · Ming-Hsuan Yang, ,https://arxiv.org/abs/2404.13153,,2404.13153.pdf,Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring,"Eliminating image blur produced by various kinds of motion has been a +challenging problem. Dominant approaches rely heavily on model capacity to +remove blurring by reconstructing residual from blurry observation in feature +space. These practices not only prevent the capture of spatially variable +motion in the real world but also ignore the tailored handling of various +motions in image space. In this paper, we propose a novel real-world deblurring +filtering model called the Motion-adaptive Separable Collaborative (MISC) +Filter. In particular, we use a motion estimation network to capture motion +information from neighborhoods, thereby adaptively estimating spatially-variant +motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The +MISC Filter first aligns the motion-induced blurring patterns to the motion +middle along the predicted flow direction, and then collaboratively filters the +aligned image through the predicted kernels, weights, and offsets to generate +the output. This design can handle more generalized and complex motion in a +spatially differentiated manner. Furthermore, we analyze the relationships +between the motion estimation network and the residual reconstruction network. +Extensive experiments on four widely used benchmarks demonstrate that our +method provides an effective solution for real-world motion blur removal and +achieves state-of-the-art performance. Code is available at +https://github.com/ChengxuLiu/MISCFilter",eess.IV,"['eess.IV', 'cs.CV']" +BoQ: A Place is Worth a Bag of Learnable Queries,Amar Ali-bey · Brahim Chaib-draa · Philippe Giguère, ,https://arxiv.org/abs/2405.07364,,2405.07364.pdf,BoQ: A Place is Worth a Bag of Learnable Queries,"In visual place recognition, accurately identifying and matching images of +locations under varying environmental conditions and viewpoints remains a +significant challenge. In this paper, we introduce a new technique, called +Bag-of-Queries (BoQ), which learns a set of global queries designed to capture +universal place-specific attributes. Unlike existing methods that employ +self-attention and generate the queries directly from the input features, BoQ +employs distinct learnable global queries, which probe the input features via +cross-attention, ensuring consistent information aggregation. In addition, our +technique provides an interpretable attention mechanism and integrates with +both CNN and Vision Transformer backbones. The performance of BoQ is +demonstrated through extensive experiments on 14 large-scale benchmarks. It +consistently outperforms current state-of-the-art techniques including NetVLAD, +MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage), +BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and +R2Former, all while being orders of magnitude faster and more efficient. The +code and model weights are publicly available at +https://github.com/amaralibey/Bag-of-Queries.",cs.CV,['cs.CV'] +Deformable One-shot Face Stylization via DINO Semantic Guidance,Yang Zhou · Zichong Chen · Hui Huang,https://vcc.tech/research/2024/DoesFS,https://arxiv.org/abs/2403.00459,,2403.00459.pdf,Deformable One-shot Face Stylization via DINO Semantic Guidance,"This paper addresses the complex issue of one-shot face stylization, focusing +on the simultaneous consideration of appearance and structure, where previous +methods have fallen short. We explore deformation-aware face stylization that +diverges from traditional single-image style reference, opting for a real-style +image pair instead. The cornerstone of our method is the utilization of a +self-supervised vision transformer, specifically DINO-ViT, to establish a +robust and consistent facial structure representation across both real and +style domains. Our stylization process begins by adapting the StyleGAN +generator to be deformation-aware through the integration of spatial +transformers (STN). We then introduce two innovative constraints for generator +fine-tuning under the guidance of DINO semantics: i) a directional deformation +loss that regulates directional vectors in DINO space, and ii) a relative +structural consistency constraint based on DINO token self-similarities, +ensuring diverse generation. Additionally, style-mixing is employed to align +the color generation with the reference, minimizing inconsistent +correspondences. This framework delivers enhanced deformability for general +one-shot face stylization, achieving notable efficiency with a fine-tuning +duration of approximately 10 minutes. Extensive qualitative and quantitative +comparisons demonstrate our superiority over state-of-the-art one-shot face +stylization methods. Code is available at https://github.com/zichongc/DoesFS",cs.CV,['cs.CV'] +Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion,Yuanxun Lu · Jingyang Zhang · Shiwei Li · Tian Fang · David McKinnon · Yanghai Tsin · Long Quan · Xun Cao · Yao Yao,https://nju-3dv.github.io/projects/direct25/,https://arxiv.org/abs/2311.15980,,2311.15980.pdf,Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion,"Recent advances in generative AI have unveiled significant potential for the +creation of 3D content. However, current methods either apply a pre-trained 2D +diffusion model with the time-consuming score distillation sampling (SDS), or a +direct 3D diffusion model trained on limited 3D data losing generation +diversity. In this work, we approach the problem by employing a multi-view 2.5D +diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D +diffusion directly models the structural distribution of 3D data, while still +maintaining the strong generalization ability of the original 2D diffusion +model, filling the gap between 2D diffusion-based and direct 3D diffusion-based +methods for 3D content generation. During inference, multi-view normal maps are +generated using the 2.5D diffusion, and a novel differentiable rasterization +scheme is introduced to fuse the almost consistent multi-view normal maps into +a consistent 3D model. We further design a normal-conditioned multi-view image +generation module for fast appearance generation given the 3D geometry. Our +method is a one-pass diffusion process and does not require any SDS +optimization as post-processing. We demonstrate through extensive experiments +that, our direct 2.5D generation with the specially-designed fusion scheme can +achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in +only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.",cs.CV,['cs.CV'] +Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation,Ba Hung Ngo · Nhat-Tuong Do-Tran · Tuan-Ngoc Nguyen · Hae-Gon Jeon · Tae Jong Choi,https://dotrannhattuong.github.io/ECB/website/,https://arxiv.org/abs/2403.18360,,2403.18360.pdf,Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation,"Most domain adaptation (DA) methods are based on either a convolutional +neural networks (CNNs) or a vision transformers (ViTs). They align the +distribution differences between domains as encoders without considering their +unique characteristics. For instance, ViT excels in accuracy due to its +superior ability to capture global representations, while CNN has an advantage +in capturing local representations. This fact has led us to design a hybrid +method to fully take advantage of both ViT and CNN, called Explicitly +Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their +distinct strengths. In particular, we leverage ViT's properties to explicitly +find class-specific decision boundaries by maximizing the discrepancy between +the outputs of the two classifiers to detect target samples far from the source +support. In contrast, the CNN encoder clusters target features based on the +previously defined class-specific boundaries by minimizing the discrepancy +between the probabilities of the two classifiers. Finally, ViT and CNN mutually +exchange knowledge to improve the quality of pseudo labels and reduce the +knowledge discrepancies of these models. Compared to conventional DA methods, +our ECB achieves superior performance, which verifies its effectiveness in this +hybrid model. The project website can be found +https://dotrannhattuong.github.io/ECB/website.",cs.CV,['cs.CV'] +Versatile Navigation under Partial Observability via Value-Guided Diffusion Policy,Gengyu Zhang · Hao Tang · Yan Yan, ,https://arxiv.org/abs/2404.02176,,2404.02176.pdf,Versatile Navigation under Partial Observability via Value-guided Diffusion Policy,"Route planning for navigation under partial observability plays a crucial +role in modern robotics and autonomous driving. Existing route planning +approaches can be categorized into two main classes: traditional autoregressive +and diffusion-based methods. The former often fails due to its myopic nature, +while the latter either assumes full observability or struggles to adapt to +unfamiliar scenarios, due to strong couplings with behavior cloning from +experts. To address these deficiencies, we propose a versatile diffusion-based +approach for both 2D and 3D route planning under partial observability. +Specifically, our value-guided diffusion policy first generates plans to +predict actions across various timesteps, providing ample foresight to the +planning. It then employs a differentiable planner with state estimations to +derive a value function, directing the agent's exploration and goal-seeking +behaviors without seeking experts while explicitly addressing partial +observability. During inference, our policy is further enhanced by a +best-plan-selection strategy, substantially boosting the planning success rate. +Moreover, we propose projecting point clouds, derived from RGB-D inputs, onto +2D grid-based bird-eye-view maps via semantic segmentation, generalizing to 3D +environments. This simple yet effective adaption enables zero-shot transfer +from 2D-trained policy to 3D, cutting across the laborious training for 3D +policy, and thus certifying our versatility. Experimental results demonstrate +our superior performance, particularly in navigating situations beyond expert +demonstrations, surpassing state-of-the-art autoregressive and diffusion-based +baselines for both 2D and 3D scenarios.",cs.RO,"['cs.RO', 'cs.AI']" +PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees,Chulin Xie · De-An Huang · Wenda Chu · Daguang Xu · Chaowei Xiao · Bo Li · Anima Anandkumar, ,https://arxiv.org/abs/2405.09771,,2405.09771.pdf,Harmonizing Generalization and Personalization in Federated Prompt Learning,"Federated Prompt Learning (FPL) incorporates large pre-trained +Vision-Language models (VLM) into federated learning through prompt tuning. The +transferable representations and remarkable generalization capacity of VLM make +them highly compatible with the integration of federated learning. Addressing +data heterogeneity in federated learning requires personalization, but +excessive focus on it across clients could compromise the model's ability to +generalize effectively. To preserve the impressive generalization capability of +VLM, it is crucial to strike a balance between personalization and +generalization in FPL. To tackle this challenge, we proposed Federated Prompt +Learning with CLIP Generalization and low-rank Personalization (FedPGP), which +employs pre-trained CLIP to provide knowledge-guidance on the global prompt for +improved generalization and incorporates a low-rank adaptation term to +personalize the global prompt. Further, FedPGP integrates a prompt-wise +contrastive loss to achieve knowledge guidance and personalized adaptation +simultaneously, enabling a harmonious balance between personalization and +generalization in FPL. We conduct extensive experiments on various datasets to +explore base-to-novel generalization in both category-level and domain-level +scenarios with heterogeneous data, showing the superiority of FedPGP in +balancing generalization and personalization.",cs.LG,['cs.LG'] +Bi-Causal: Group Activity Recognition via Bidirectional Causality,Youliang Zhang · Wenxuan Liu · danni xu · Zhuo Zhou · Zheng Wang, ,https://arxiv.org/html/2312.00404v1,,2312.00404v1.pdf,A Causality-Aware Pattern Mining Scheme for Group Activity Recognition in a Pervasive Sensor Space,"Human activity recognition (HAR) is a key challenge in pervasive computing +and its solutions have been presented based on various disciplines. +Specifically, for HAR in a smart space without privacy and accessibility +issues, data streams generated by deployed pervasive sensors are leveraged. In +this paper, we focus on a group activity by which a group of users perform a +collaborative task without user identification and propose an efficient group +activity recognition scheme which extracts causality patterns from pervasive +sensor event sequences generated by a group of users to support as good +recognition accuracy as the state-of-the-art graphical model. To filter out +irrelevant noise events from a given data stream, a set of rules is leveraged +to highlight causally related events. Then, a pattern-tree algorithm extracts +frequent causal patterns by means of a growing tree structure. Based on the +extracted patterns, a weighted sum-based pattern matching algorithm computes +the likelihoods of stored group activities to the given test event sequence by +means of matched event pattern counts for group activity recognition. We +evaluate the proposed scheme using the data collected from our testbed and +CASAS datasets where users perform their tasks on a daily basis and validate +its effectiveness in a real environment. Experiment results show that the +proposed scheme performs higher recognition accuracy and with a small amount of +runtime overhead than the existing schemes.",cs.LG,"['cs.LG', 'cs.DB']" +MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning,Yixin Liu · Chenrui Fan · Yutong Dai · Xun Chen · Pan Zhou · Lichao Sun, ,https://arxiv.org/abs/2311.13127v3,,2311.13127v3.pdf,MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning,"Text-to-image diffusion models allow seamless generation of personalized +images from scant reference photos. Yet, these tools, in the wrong hands, can +fabricate misleading or harmful content, endangering individuals. To address +this problem, existing poisoning-based approaches perturb user images in an +imperceptible way to render them ""unlearnable"" from malicious uses. We identify +two limitations of these defending approaches: i) sub-optimal due to the +hand-crafted heuristics for solving the intractable bilevel optimization and +ii) lack of robustness against simple data transformations like Gaussian +filtering. To solve these challenges, we propose MetaCloak, which solves the +bi-level poisoning problem with a meta-learning framework with an additional +transformation sampling process to craft transferable and robust perturbation. +Specifically, we employ a pool of surrogate diffusion models to craft +transferable and model-agnostic perturbation. Furthermore, by incorporating an +additional transformation process, we design a simple denoising-error +maximization loss that is sufficient for causing transformation-robust semantic +distortion and degradation in a personalized generation. Extensive experiments +on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing +approaches. Notably, MetaCloak can successfully fool online training services +like Replicate, in a black-box manner, demonstrating the effectiveness of +MetaCloak in real-world scenarios. Our code is available at +https://github.com/liuyixin-louis/MetaCloak.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR']" +BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models,Fengyuan Shi · Jiaxi Gu · Hang Xu · Songcen Xu · Wei Zhang · Limin Wang,https://github.com/MCG-NJU/BIVDiff,https://arxiv.org/abs/2312.02813,,2312.02813.pdf,BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models,"Diffusion models have made tremendous progress in text-driven image and video +generation. Now text-to-image foundation models are widely applied to various +downstream image synthesis tasks, such as controllable image generation and +image editing, while downstream video synthesis tasks are less explored for +several reasons. First, it requires huge memory and computation overhead to +train a video generation foundation model. Even with video foundation models, +additional costly training is still required for downstream video synthesis +tasks. Second, although some works extend image diffusion models into videos in +a training-free manner, temporal consistency cannot be well preserved. Finally, +these adaption methods are specifically designed for one task and fail to +generalize to different tasks. To mitigate these issues, we propose a +training-free general-purpose video synthesis framework, coined as {\bf +BIVDiff}, via bridging specific image diffusion models and general +text-to-video foundation diffusion models. Specifically, we first use a +specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for +frame-wise video generation, then perform Mixed Inversion on the generated +video, and finally input the inverted latents into the video diffusion models +(e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework +enables flexible image model selection for different purposes with strong task +generalization and high efficiency. To validate the effectiveness and general +use of BIVDiff, we perform a wide range of video synthesis tasks, including +controllable video generation, video editing, video inpainting, and +outpainting.",cs.CV,"['cs.CV', 'cs.AI']" +A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames,Pinelopi Papalampidi · Skanda Koppula · Shreya Pathak · Justin Chiu · Joseph Heyward · Viorica Patraucean · Jiajun Shen · Antoine Miech · Andrew Zisserman · Aida Nematzadeh, ,https://arxiv.org/abs/2312.07395,,2312.07395.pdf,A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames,"Understanding long, real-world videos requires modeling of long-range visual +dependencies. To this end, we explore video-first architectures, building on +the common paradigm of transferring large-scale, image--text models to video +via shallow temporal fusion. However, we expose two limitations to the +approach: (1) decreased spatial capabilities, likely due to poor +video--language alignment in standard video datasets, and (2) higher memory +consumption, bottlenecking the number of frames that can be processed. To +mitigate the memory bottleneck, we systematically analyze the memory/accuracy +trade-off of various efficient methods: factorized attention, +parameter-efficient image-to-video adaptation, input masking, and +multi-resolution patchification. Surprisingly, simply masking large portions of +the video (up to 75%) during contrastive pre-training proves to be one of the +most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our +simple approach for training long video-to-text models, which scales to 1B +parameters, does not add new architectural complexity and is able to outperform +the popular paradigm of using much larger LLMs as an information aggregator +over segment-based information on benchmarks with long-range temporal +dependencies (YouCook2, EgoSchema).",cs.CV,"['cs.CV', 'cs.CL']" +DaReNeRF: Direction-aware Representation for Dynamic Scenes,Ange Lou · Benjamin Planche · Zhongpai Gao · Yamin Li · Tianyu Luan · Hao Ding · Terrence Chen · Jack Noble · Ziyan Wu, ,https://arxiv.org/abs/2403.02265v1,,2403.02265v1.pdf,DaReNeRF: Direction-aware Representation for Dynamic Scenes,"Addressing the intricate challenge of modeling and re-rendering dynamic +scenes, most recent approaches have sought to simplify these complexities using +plane-based explicit representations, overcoming the slow training time issues +associated with methods like Neural Radiance Fields (NeRF) and implicit +representations. However, the straightforward decomposition of 4D dynamic +scenes into multiple 2D plane-based representations proves insufficient for +re-rendering high-fidelity scenes with complex motions. In response, we present +a novel direction-aware representation (DaRe) approach that captures scene +dynamics from six different directions. This learned representation undergoes +an inverse dual-tree complex wavelet transformation (DTCWT) to recover +plane-based information. DaReNeRF computes features for each space-time point +by fusing vectors from these recovered planes. Combining DaReNeRF with a tiny +MLP for color regression and leveraging volume rendering in training yield +state-of-the-art performance in novel view synthesis for complex dynamic +scenes. Notably, to address redundancy introduced by the six real and six +imaginary direction-aware wavelet coefficients, we introduce a trainable +masking approach, mitigating storage issues without significant performance +decline. Moreover, DaReNeRF maintains a 2x reduction in training time compared +to prior art while delivering superior performance.",cs.CV,"['cs.CV', 'cs.GR']" +Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels,Zhuohong Li · Wei He · Jiepan Li · Fangxiao Lu · Hongyan Zhang, ,https://arxiv.org/abs/2403.02746,,2403.02746.pdf,Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels,"Large-scale high-resolution (HR) land-cover mapping is a vital task to survey +the Earth's surface and resolve many challenges facing humanity. However, it is +still a non-trivial task hindered by complex ground details, various landforms, +and the scarcity of accurate training labels over a wide-span geographic area. +In this paper, we propose an efficient, weakly supervised framework +(Paraformer) to guide large-scale HR land-cover mapping with easy-access +historical land-cover data of low resolution (LR). Specifically, existing +land-cover mapping approaches reveal the dominance of CNNs in preserving local +ground details but still suffer from insufficient global modeling in various +landforms. Therefore, we design a parallel CNN-Transformer feature extractor in +Paraformer, consisting of a downsampling-free CNN branch and a Transformer +branch, to jointly capture local and global contextual information. Besides, +facing the spatial mismatch of training data, a pseudo-label-assisted training +(PLAT) module is adopted to reasonably refine LR labels for weakly supervised +semantic segmentation of HR images. Experiments on two large-scale datasets +demonstrate the superiority of Paraformer over other state-of-the-art methods +for automatically updating HR land-cover maps from LR historical labels.",cs.CV,"['cs.CV', 'cs.LG']" +SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks,Yaxu Xie · Alain Pagani · Didier Stricker, ,https://arxiv.org/abs/2403.19474,,2403.19474.pdf,SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks,"Scene graphs have been recently introduced into 3D spatial understanding as a +comprehensive representation of the scene. The alignment between 3D scene +graphs is the first step of many downstream tasks such as scene graph aided +point cloud registration, mosaicking, overlap checking, and robot navigation. +In this work, we treat 3D scene graph alignment as a partial graph-matching +problem and propose to solve it with a graph neural network. We reuse the +geometric features learned by a point cloud registration method and associate +the clustered point-level geometric features with the node-level semantic +feature via our designed feature fusion module. Partial matching is enabled by +using a learnable method to select the top-k similar node pairs. Subsequent +downstream tasks such as point cloud registration are achieved by running a +pre-trained registration network within the matched regions. We further propose +a point-matching rescoring method, that uses the node-wise alignment of the 3D +scene graph to reweight the matching candidates from a pre-trained point cloud +registration method. It reduces the false point correspondences estimated +especially in low-overlapping cases. Experiments show that our method improves +the alignment accuracy by 10~20% in low-overlap and random transformation +scenarios and outperforms the existing work in multiple downstream tasks.",cs.CV,"['cs.CV', 'cs.RO']" +Frequency-Adaptive Dilated Convolution for Semantic Segmentation,Linwei Chen · Lin Gu · Dezhi Zheng · Ying Fu,https://github.com/Linwei-Chen/FADC,https://arxiv.org/abs/2403.05369,,2403.05369.pdf,Frequency-Adaptive Dilated Convolution for Semantic Segmentation,"Dilated convolution, which expands the receptive field by inserting gaps +between its consecutive elements, is widely employed in computer vision. In +this study, we propose three strategies to improve individual phases of dilated +convolution from the view of spectrum analysis. Departing from the conventional +practice of fixing a global dilation rate as a hyperparameter, we introduce +Frequency-Adaptive Dilated Convolution (FADC), which dynamically adjusts +dilation rates spatially based on local frequency components. Subsequently, we +design two plug-in modules to directly enhance effective bandwidth and +receptive field size. The Adaptive Kernel (AdaKern) module decomposes +convolution weights into low-frequency and high-frequency components, +dynamically adjusting the ratio between these components on a per-channel +basis. By increasing the high-frequency part of convolution weights, AdaKern +captures more high-frequency components, thereby improving effective bandwidth. +The Frequency Selection (FreqSelect) module optimally balances high- and +low-frequency components in feature representations through spatially variant +reweighting. It suppresses high frequencies in the background to encourage FADC +to learn a larger dilation, thereby increasing the receptive field for an +expanded scope. Extensive experiments on segmentation and object detection +consistently validate the efficacy of our approach. The code is publicly +available at https://github.com/Linwei-Chen/FADC.",cs.CV,['cs.CV'] +Distilled Datamodel with Reverse Gradient Matching,Jingwen Ye · Ruonan Yu · Songhua Liu · Xinchao Wang, ,https://arxiv.org/abs/2404.14006,,2404.14006.pdf,Distilled Datamodel with Reverse Gradient Matching,"The proliferation of large-scale AI models trained on extensive datasets has +revolutionized machine learning. With these models taking on increasingly +central roles in various applications, the need to understand their behavior +and enhance interpretability has become paramount. To investigate the impact of +changes in training data on a pre-trained model, a common approach is +leave-one-out retraining. This entails systematically altering the training +dataset by removing specific samples to observe resulting changes within the +model. However, retraining the model for each altered dataset presents a +significant computational challenge, given the need to perform this operation +for every dataset variation. In this paper, we introduce an efficient framework +for assessing data impact, comprising offline training and online evaluation +stages. During the offline training phase, we approximate the influence of +training data on the target model through a distilled synset, formulated as a +reversed gradient matching problem. For online evaluation, we expedite the +leave-one-out process using the synset, which is then utilized to compute the +attribution matrix based on the evaluation objective. Experimental evaluations, +including training data attribution and assessments of data quality, +demonstrate that our proposed method achieves comparable model behavior +evaluation while significantly speeding up the process compared to the direct +retraining method.",cs.LG,"['cs.LG', 'cs.CV']" +Memory-based Adapters for Online 3D Scene Perception,Xiuwei Xu · Chong Xia · Ziwei Wang · Linqing Zhao · Linqing Zhao · Yueqi Duan · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2403.06974,,2403.06974.pdf,Memory-based Adapters for Online 3D Scene Perception,"In this paper, we propose a new framework for online 3D scene perception. +Conventional 3D scene perception methods are offline, i.e., take an already +reconstructed 3D scene geometry as input, which is not applicable in robotic +applications where the input data is streaming RGB-D videos rather than a +complete 3D scene reconstructed from pre-collected RGB-D videos. To deal with +online 3D scene perception tasks where data collection and perception should be +performed simultaneously, the model should be able to process 3D scenes frame +by frame and make use of the temporal information. To this end, we propose an +adapter-based plug-and-play module for the backbone of 3D scene perception +model, which constructs memory to cache and aggregate the extracted RGB-D +features to empower offline models with temporal learning ability. +Specifically, we propose a queued memory mechanism to cache the supporting +point cloud and image features. Then we devise aggregation modules which +directly perform on the memory and pass temporal information to current frame. +We further propose 3D-to-2D adapter to enhance image features with strong +global context. Our adapters can be easily inserted into mainstream offline +architectures of different tasks and significantly boost their performance on +online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate +our approach achieves leading performance on three 3D scene perception tasks +compared with state-of-the-art online methods by simply finetuning existing +offline models, without any model and task-specific designs. +\href{https://xuxw98.github.io/Online3D/}{Project page}.",cs.CV,['cs.CV'] +Ungeneralizable Examples,Jingwen Ye · Xinchao Wang, ,https://arxiv.org/abs/2404.14016,,2404.14016.pdf,Ungeneralizable Examples,"The training of contemporary deep learning models heavily relies on publicly +available data, posing a risk of unauthorized access to online data and raising +concerns about data privacy. Current approaches to creating unlearnable data +involve incorporating small, specially designed noises, but these methods +strictly limit data usability, overlooking its potential usage in authorized +scenarios. In this paper, we extend the concept of unlearnable data to +conditional data learnability and introduce \textbf{U}n\textbf{G}eneralizable +\textbf{E}xamples (UGEs). UGEs exhibit learnability for authorized users while +maintaining unlearnability for potential hackers. The protector defines the +authorized network and optimizes UGEs to match the gradients of the original +data and its ungeneralizable version, ensuring learnability. To prevent +unauthorized learning, UGEs are trained by maximizing a designated distance +loss in a common feature space. Additionally, to further safeguard the +authorized side from potential attacks, we introduce additional undistillation +optimization. Experimental results on multiple datasets and various networks +demonstrate that the proposed UGEs framework preserves data usability while +reducing training performance on hacker networks, even under different types of +attacks.",cs.LG,"['cs.LG', 'cs.CV']" +ColorPCR: Color Point Cloud Registration with Multi-Stage Geometric-Color Fusion,Juncheng Mu · Lin Bie · Shaoyi Du · Yue Gao, ,,https://www.mdpi.com/2072-4292/16/5/743,,,,,nan +IIRP-Net: Iterative Inference Residual Pyramid Network for Enhanced Image Registration,Tai Ma · zhangsuwei · Jiafeng Li · Ying Wen, ,https://arxiv.org/html/2312.13396v1,,2312.13396v1.pdf,EPNet: An Efficient Pyramid Network for Enhanced Single-Image Super-Resolution with Reduced Computational Requirements,"Single-image super-resolution (SISR) has seen significant advancements +through the integration of deep learning. However, the substantial +computational and memory requirements of existing methods often limit their +practical application. This paper introduces a new Efficient Pyramid Network +(EPNet) that harmoniously merges an Edge Split Pyramid Module (ESPM) with a +Panoramic Feature Extraction Module (PFEM) to overcome the limitations of +existing methods, particularly in terms of computational efficiency. The ESPM +applies a pyramid-based channel separation strategy, boosting feature +extraction while maintaining computational efficiency. The PFEM, a novel fusion +of CNN and Transformer structures, enables the concurrent extraction of local +and global features, thereby providing a panoramic view of the image landscape. +Our architecture integrates the PFEM in a manner that facilitates the +streamlined exchange of feature information and allows for the further +refinement of image texture details. Experimental results indicate that our +model outperforms existing state-of-the-art methods in image resolution +quality, while considerably decreasing computational and memory costs. This +research contributes to the ongoing evolution of efficient and practical SISR +methodologies, bearing broader implications for the field of computer vision.",cs.CV,['cs.CV'] +Towards Efficient Replay in Federated Incremental Learning,Yichen Li · Qunwei Li · Haozhao Wang · Ruixuan Li · Wenliang Zhong · Guannan Zhang, ,https://arxiv.org/abs/2403.05890,,2403.05890.pdf,Towards Efficient Replay in Federated Incremental Learning,"In Federated Learning (FL), the data in each client is typically assumed +fixed or static. However, data often comes in an incremental manner in +real-world applications, where the data domain may increase dynamically. In +this work, we study catastrophic forgetting with data heterogeneity in +Federated Incremental Learning (FIL) scenarios where edge clients may lack +enough storage space to retain full data. We propose to employ a simple, +generic framework for FIL named Re-Fed, which can coordinate each client to +cache important samples for replay. More specifically, when a new task arrives, +each client first caches selected previous samples based on their global and +local importance. Then, the client trains the local model with both the cached +samples and the samples from the new task. Theoretically, we analyze the +ability of Re-Fed to discover important samples for replay thus alleviating the +catastrophic forgetting problem. Moreover, we empirically show that Re-Fed +achieves competitive performance compared to state-of-the-art methods.",cs.LG,"['cs.LG', 'cs.DC']" +Disentangled Pre-training for Human-Object Interaction Detection,Zhuolong Li · Xingao Li · Changxing Ding · Xiangmin Xu,https://github.com/xingaoli/DP-HOI,https://arxiv.org/abs/2404.01725,,2404.01725.pdf,Disentangled Pre-training for Human-Object Interaction Detection,"Detecting human-object interaction (HOI) has long been limited by the amount +of supervised data available. Recent approaches address this issue by +pre-training according to pseudo-labels, which align object regions with HOI +triplets parsed from image captions. However, pseudo-labeling is tricky and +noisy, making HOI pre-training a complex process. Therefore, we propose an +efficient disentangled pre-training method for HOI detection (DP-HOI) to +address this problem. First, DP-HOI utilizes object detection and action +recognition datasets to pre-train the detection and interaction decoder layers, +respectively. Then, we arrange these decoder layers so that the pre-training +architecture is consistent with the downstream HOI detection task. This +facilitates efficient knowledge transfer. Specifically, the detection decoder +identifies reliable human instances in each action recognition dataset image, +generates one corresponding query, and feeds it into the interaction decoder +for verb classification. Next, we combine the human instance verb predictions +in the same image and impose image-level supervision. The DP-HOI structure can +be easily adapted to the HOI detection task, enabling effective model parameter +initialization. Therefore, it significantly enhances the performance of +existing HOI detection models on a broad range of rare categories. The code and +pre-trained weight are available at https://github.com/xingaoli/DP-HOI.",cs.CV,['cs.CV'] +RegionGPT: Towards Region Understanding Vision Language Model,Qiushan Guo · Shalini De Mello · Danny Yin · Wonmin Byeon · Ka Chun Cheung · Yizhou Yu · Ping Luo · Sifei Liu,https://guoqiushan.github.io/regiongpt.github.io/,https://arxiv.org/abs/2403.02330v1,,2403.02330v1.pdf,RegionGPT: Towards Region Understanding Vision Language Model,"Vision language models (VLMs) have experienced rapid advancements through the +integration of large language models (LLMs) with image-text pairs, yet they +struggle with detailed regional visual understanding due to limited spatial +awareness of the vision encoder, and the use of coarse-grained training data +that lacks detailed, region-specific captions. To address this, we introduce +RegionGPT (short as RGPT), a novel framework designed for complex region-level +captioning and understanding. RGPT enhances the spatial awareness of regional +representation with simple yet effective modifications to existing visual +encoders in VLMs. We further improve performance on tasks requiring a specific +output scope by integrating task-guided instruction prompts during both +training and inference phases, while maintaining the model's versatility for +general-purpose tasks. Additionally, we develop an automated region caption +data generation pipeline, enriching the training set with detailed region-level +captions. We demonstrate that a universal RGPT model can be effectively applied +and significantly enhancing performance across a range of region-level tasks, +including but not limited to complex region descriptions, reasoning, object +classification, and referring expressions comprehension.",cs.CV,['cs.CV'] +Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence,Ripon Saha · Dehao Qin · Nianyi Li · Jinwei Ye · Suren Jayasuriya, ,https://arxiv.org/abs/2404.13605,,2404.13605.pdf,Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence,"Tackling image degradation due to atmospheric turbulence, particularly in +dynamic environment, remains a challenge for long-range imaging systems. +Existing techniques have been primarily designed for static scenes or scenes +with small motion. This paper presents the first segment-then-restore pipeline +for restoring the videos of dynamic scenes in turbulent environment. We +leverage mean optical flow with an unsupervised motion segmentation method to +separate dynamic and static scene components prior to restoration. After camera +shake compensation and segmentation, we introduce foreground/background +enhancement leveraging the statistics of turbulence strength and a transformer +model trained on a novel noise-based procedural turbulence generator for fast +dataset augmentation. Benchmarked against existing restoration methods, our +approach restores most of the geometric distortion and enhances sharpness for +videos. We make our code, simulator, and data publicly available to advance the +field of video restoration from turbulence: riponcs.github.io/TurbSegRes",cs.CV,"['cs.CV', 'eess.IV']" +Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo,Zongrui Li · Zhan Lu · Haojie Yan · Boxin Shi · Gang Pan · Qian Zheng · Xudong Jiang, ,https://arxiv.org/abs/2404.01612,,2404.01612.pdf,Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo,"Natural Light Uncalibrated Photometric Stereo (NaUPS) relieves the strict +environment and light assumptions in classical Uncalibrated Photometric Stereo +(UPS) methods. However, due to the intrinsic ill-posedness and high-dimensional +ambiguities, addressing NaUPS is still an open question. Existing works impose +strong assumptions on the environment lights and objects' material, restricting +the effectiveness in more general scenarios. Alternatively, some methods +leverage supervised learning with intricate models while lacking +interpretability, resulting in a biased estimation. In this work, we proposed +Spin Light Uncalibrated Photometric Stereo (Spin-UP), an unsupervised method to +tackle NaUPS in various environment lights and objects. The proposed method +uses a novel setup that captures the object's images on a rotatable platform, +which mitigates NaUPS's ill-posedness by reducing unknowns and provides +reliable priors to alleviate NaUPS's ambiguities. Leveraging neural inverse +rendering and the proposed training strategies, Spin-UP recovers surface +normals, environment light, and isotropic reflectance under complex natural +light with low computational cost. Experiments have shown that Spin-UP +outperforms other supervised / unsupervised NaUPS methods and achieves +state-of-the-art performance on synthetic and real-world datasets. Codes and +data are available at https://github.com/LMozart/CVPR2024-SpinUP.",cs.CV,['cs.CV'] +Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training,Arun Reddy · William Paul · Corban Rivera · Ketul Shah · Celso M. de Melo · Rama Chellappa, ,https://arxiv.org/abs/2312.02914,,2312.02914.pdf,Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training,"In this work, we tackle the problem of unsupervised domain adaptation (UDA) +for video action recognition. Our approach, which we call UNITE, uses an image +teacher model to adapt a video student model to the target domain. UNITE first +employs self-supervised pre-training to promote discriminative feature learning +on target domain videos using a teacher-guided masked distillation objective. +We then perform self-training on masked target data, using the video student +model and image teacher model together to generate improved pseudolabels for +unlabeled target videos. Our self-training process successfully leverages the +strengths of both models to achieve strong transfer performance across domains. +We evaluate our approach on multiple video domain adaptation benchmarks and +observe significant improvements upon previously reported results.",cs.CV,"['cs.CV', 'cs.LG']" +Would Deep Generative Models Amplify Bias in Future Models?,Tianwei Chen · Yusuke Hirota · Mayu Otani · Noa Garcia · Yuta Nakashima, ,https://arxiv.org/abs/2404.03242,,2404.03242.pdf,Would Deep Generative Models Amplify Bias in Future Models?,"We investigate the impact of deep generative models on potential social +biases in upcoming computer vision models. As the internet witnesses an +increasing influx of AI-generated images, concerns arise regarding inherent +biases that may accompany them, potentially leading to the dissemination of +harmful content. This paper explores whether a detrimental feedback loop, +resulting in bias amplification, would occur if generated images were used as +the training data for future models. We conduct simulations by progressively +substituting original images in COCO and CC3M datasets with images generated +through Stable Diffusion. The modified datasets are used to train OpenCLIP and +image captioning models, which we evaluate in terms of quality and bias. +Contrary to expectations, our findings indicate that introducing generated +images during training does not uniformly amplify bias. Instead, instances of +bias mitigation across specific tasks are observed. We further explore the +factors that may influence these phenomena, such as artifacts in image +generation (e.g., blurry faces) or pre-existing biases in the original +datasets.",cs.CV,['cs.CV'] +Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models,Xingqian Xu · Jiayi Guo · Zhangyang Wang · Gao Huang · Irfan Essa · Humphrey Shi, ,,https://openreview.net/forum?id=QL3Zuth6E7,,,,,nan +Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models,Jiayi Guo · Xingqian Xu · Yifan Pu · Zanlin Ni · Chaofei Wang · Manushree Vasu · Shiji Song · Gao Huang · Humphrey Shi,https://shi-labs.github.io/Smooth-Diffusion/,https://arxiv.org/abs/2312.04410,,2312.04410.pdf,Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models,"Recently, diffusion models have made remarkable progress in text-to-image +(T2I) generation, synthesizing images with high fidelity and diverse contents. +Despite this advancement, latent space smoothness within diffusion models +remains largely unexplored. Smooth latent spaces ensure that a perturbation on +an input latent corresponds to a steady change in the output image. This +property proves beneficial in downstream tasks, including image interpolation, +inversion, and editing. In this work, we expose the non-smoothness of diffusion +latent spaces by observing noticeable visual fluctuations resulting from minor +latent variations. To tackle this issue, we propose Smooth Diffusion, a new +category of diffusion models that can be simultaneously high-performing and +smooth. Specifically, we introduce Step-wise Variation Regularization to +enforce the proportion between the variations of an arbitrary input latent and +that of the output image is a constant at any diffusion training step. In +addition, we devise an interpolation standard deviation (ISTD) metric to +effectively assess the latent space smoothness of a diffusion model. Extensive +quantitative and qualitative experiments demonstrate that Smooth Diffusion +stands out as a more desirable solution not only in T2I generation but also +across various downstream tasks. Smooth Diffusion is implemented as a +plug-and-play Smooth-LoRA to work with various community models. Code is +available at https://github.com/SHI-Labs/Smooth-Diffusion.",cs.CV,['cs.CV'] +PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor,Vidit Goel · Elia Peruzzo · Yifan Jiang · Dejia Xu · Xingqian Xu · Nicu Sebe · Trevor Darrell · Zhangyang Wang · Humphrey Shi,https://vidit98.github.io/publication/conference-paper/pair_diff.html,,https://openreview.net/forum?id=cI5j8tEPNU,,,,,nan +Large Language Models are Good Prompt Learners for Low-Shot Image Classification,Zhaoheng Zheng · Jingmin Wei · Xuefeng Hu · Haidong Zhu · Ram Nevatia, ,https://arxiv.org/abs/2312.04076,,2312.04076.pdf,Large Language Models are Good Prompt Learners for Low-Shot Image Classification,"Low-shot image classification, where training images are limited or +inaccessible, has benefited from recent progress on pre-trained vision-language +(VL) models with strong generalizability, e.g. CLIP. Prompt learning methods +built with VL models generate text features from the class names that only have +confined class-specific information. Large Language Models (LLMs), with their +vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we +discuss the integration of LLMs to enhance pre-trained VL models, specifically +on low-shot classification. However, the domain gap between language and vision +blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language +Models as Prompt learners, that produces adaptive prompts for the CLIP text +encoder, establishing it as the connecting bridge. Experiments show that, +compared with other state-of-the-art prompt learning methods, LLaMP yields +better performance on both zero-shot generalization and few-shot image +classification, over a spectrum of 11 datasets. Code will be made available at: +https://github.com/zhaohengz/LLaMP.",cs.CV,['cs.CV'] +SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream,Lin Zhu · Kangmin Jia · Yifan Zhao · Yunshan Qi · Lizhi Wang · Hua Huang, ,https://arxiv.org/abs/2403.11222,,2403.11222.pdf,SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream,"Spike cameras, leveraging spike-based integration sampling and high temporal +resolution, offer distinct advantages over standard cameras. However, existing +approaches reliant on spike cameras often assume optimal illumination, a +condition frequently unmet in real-world scenarios. To address this, we +introduce SpikeNeRF, the first work that derives a NeRF-based volumetric scene +representation from spike camera data. Our approach leverages NeRF's multi-view +consistency to establish robust self-supervision, effectively eliminating +erroneous measurements and uncovering coherent structures within exceedingly +noisy input amidst diverse real-world illumination scenarios. The framework +comprises two core elements: a spike generation model incorporating an +integrate-and-fire neuron layer and parameters accounting for non-idealities, +such as threshold variation, and a spike rendering loss capable of generalizing +across varying illumination conditions. We describe how to effectively optimize +neural radiance fields to render photorealistic novel views from the novel +continuous spike stream, demonstrating advantages over other vision sensors in +certain scenes. Empirical evaluations conducted on both real and novel +realistically simulated sequences affirm the efficacy of our methodology. The +dataset and source code are released at +https://github.com/BIT-Vision/SpikeNeRF.",cs.CV,['cs.CV'] +UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement,yaofeng xie · Lingwei Kong · Kai Chen · Zheng Ziqiang · Xiao Yu · Zhibin Yu · Bing Zheng,https://github.com/yzbouc/UVEB,https://arxiv.org/abs/2404.14542,,2404.14542.pdf,UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement,"Learning-based underwater image enhancement (UIE) methods have made great +progress. However, the lack of large-scale and high-quality paired training +samples has become the main bottleneck hindering the development of UIE. The +inter-frame information in underwater videos can accelerate or optimize the UIE +process. Thus, we constructed the first large-scale high-resolution underwater +video enhancement benchmark (UVEB) to promote the development of underwater +vision.It contains 1,308 pairs of video sequences and more than 453,000 +high-resolution with 38\% Ultra-High-Definition (UHD) 4K frame pairs. UVEB +comes from multiple countries, containing various scenes and video degradation +types to adapt to diverse and complex underwater environments. We also propose +the first supervised underwater video enhancement method, UVE-Net. UVE-Net +converts the current frame information into convolutional kernels and passes +them to adjacent frames for efficient inter-frame information exchange. By +fully utilizing the redundant degraded information of underwater videos, +UVE-Net completes video enhancement better. Experiments show the effective +network design and good performance of UVE-Net.",cs.CV,"['cs.CV', 'I.4']" +Single-View Scene Point Cloud Human Grasp Generation,Yan-Kang Wang · Chengyi Xing · Yi-Lin Wei · Xiao-Ming Wu · Wei-Shi Zheng, ,https://arxiv.org/abs/2404.15815,,2404.15815.pdf,Single-View Scene Point Cloud Human Grasp Generation,"In this work, we explore a novel task of generating human grasps based on +single-view scene point clouds, which more accurately mirrors the typical +real-world situation of observing objects from a single viewpoint. Due to the +incompleteness of object point clouds and the presence of numerous scene +points, the generated hand is prone to penetrating into the invisible parts of +the object and the model is easily affected by scene points. Thus, we introduce +S2HGrasp, a framework composed of two key modules: the Global Perception module +that globally perceives partial object point clouds, and the DiffuGrasp module +designed to generate high-quality human grasps based on complex inputs that +include scene points. Additionally, we introduce S2HGD dataset, which comprises +approximately 99,000 single-object single-view scene point clouds of 1,668 +unique objects, each annotated with one human grasp. Our extensive experiments +demonstrate that S2HGrasp can not only generate natural human grasps regardless +of scene points, but also effectively prevent penetration between the hand and +invisible parts of the object. Moreover, our model showcases strong +generalization capability when applied to unseen objects. Our code and dataset +are available at https://github.com/iSEE-Laboratory/S2HGrasp.",cs.CV,['cs.CV'] +MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception,Yiran Qin · Enshen Zhou · Qichang Liu · Zhenfei Yin · Lu Sheng · Ruimao Zhang · Yu Qiao · Jing Shao,https://iranqin.github.io/MP5.github.io/,https://arxiv.org/abs/2312.07472,,2312.07472.pdf,MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception,"It is a long-lasting goal to design an embodied system that can solve +long-horizon open-world tasks in human-like ways. However, existing approaches +usually struggle with compound difficulties caused by the logic-aware +decomposition and context-aware execution of these tasks. To this end, we +introduce MP5, an open-ended multimodal embodied system built upon the +challenging Minecraft simulator, which can decompose feasible sub-objectives, +design sophisticated situation-aware plans, and perform embodied action +control, with frequent communication with a goal-conditioned active perception +scheme. Specifically, MP5 is developed on top of recent advances in Multimodal +Large Language Models (MLLMs), and the system is modulated into functional +modules that can be scheduled and collaborated to ultimately solve pre-defined +context- and process-dependent tasks. Extensive experiments prove that MP5 can +achieve a 22% success rate on difficult process-dependent tasks and a 91% +success rate on tasks that heavily depend on the context. Moreover, MP5 +exhibits a remarkable ability to address many open-ended tasks that are +entirely novel.",cs.CV,['cs.CV'] +Scaling Up Video Summarization Pretraining with Large Language Models,Dawit Argaw Argaw · Seunghyun Yoon · Fabian Caba Heilbron · Hanieh Deilamsalehy · Trung Bui · Zhaowen Wang · Franck Dernoncourt · Joon Chung, ,https://arxiv.org/abs/2404.03398,,2404.03398.pdf,Scaling Up Video Summarization Pretraining with Large Language Models,"Long-form video content constitutes a significant portion of internet +traffic, making automated video summarization an essential research problem. +However, existing video summarization datasets are notably limited in their +size, constraining the effectiveness of state-of-the-art methods for +generalization. Our work aims to overcome this limitation by capitalizing on +the abundance of long-form videos with dense speech-to-video alignment and the +remarkable capabilities of recent large language models (LLMs) in summarizing +long text. We introduce an automated and scalable pipeline for generating a +large-scale video summarization dataset using LLMs as Oracle summarizers. By +leveraging the generated dataset, we analyze the limitations of existing +approaches and propose a new video summarization model that effectively +addresses them. To facilitate further research in the field, our work also +presents a new benchmark dataset that contains 1200 long videos each with +high-quality summaries annotated by professionals. Extensive experiments +clearly indicate that our proposed approach sets a new state-of-the-art in +video summarization across several benchmarks.",cs.CV,['cs.CV'] +CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification,Haoran Lai · Qingsong Yao · Zihang Jiang · Rongsheng Wang · Zhiyang He · Xiaodong Tao · S Kevin Zhou, ,https://arxiv.org/abs/2402.17417,,2402.17417.pdf,CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification,"The advancement of Zero-Shot Learning in the medical domain has been driven +forward by using pre-trained models on large-scale image-text pairs, focusing +on image-text alignment. However, existing methods primarily rely on cosine +similarity for alignment, which may not fully capture the complex relationship +between medical images and reports. To address this gap, we introduce a novel +approach called Cross-Attention Alignment for Radiology Zero-Shot +Classification (CARZero). Our approach innovatively leverages cross-attention +mechanisms to process image and report features, creating a Similarity +Representation that more accurately reflects the intricate relationships in +medical semantics. This representation is then linearly projected to form an +image-text similarity matrix for cross-modality alignment. Additionally, +recognizing the pivotal role of prompt selection in zero-shot learning, CARZero +incorporates a Large Language Model-based prompt alignment strategy. This +strategy standardizes diverse diagnostic expressions into a unified format for +both training and inference phases, overcoming the challenges of manual prompt +design. Our approach is simple yet effective, demonstrating state-of-the-art +performance in zero-shot classification on five official chest radiograph +diagnostic test sets, including remarkable results on datasets with long-tail +distributions of rare diseases. This achievement is attributed to our new +image-text alignment strategy, which effectively addresses the complex +relationship between medical images and reports. Code and models are available +at https://github.com/laihaoran/CARZero.",cs.CV,['cs.CV'] +LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes,Yanwen Guo · Yuanqi Li · Dayong Ren · Xiaohong Zhang · Jiawei Li · Liang Pu · Changfeng Ma · xiaoyu zhan · Jie Guo · Mingqiang Wei · Yan Zhang · Piaopiao Yu · Shuangyu Yang · Donghao Ji · Huisheng Ye · Hao Sun · Yansong Liu · Yinuo Chen · Jiaqi Zhu · Hongyu Liu, ,https://arxiv.org/html/2309.13596v2,,2309.13596v2.pdf,Advancements in 3D Lane Detection Using LiDAR Point Clouds: From Data Collection to Model Development,"Advanced Driver-Assistance Systems (ADAS) have successfully integrated +learning-based techniques into vehicle perception and decision-making. However, +their application in 3D lane detection for effective driving environment +perception is hindered by the lack of comprehensive LiDAR datasets. The sparse +nature of LiDAR point cloud data prevents an efficient manual annotation +process. To solve this problem, we present LiSV-3DLane, a large-scale 3D lane +dataset that comprises 20k frames of surround-view LiDAR point clouds with +enriched semantic annotation. Unlike existing datasets confined to a frontal +perspective, LiSV-3DLane provides a full 360-degree spatial panorama around the +ego vehicle, capturing complex lane patterns in both urban and highway +environments. We leverage the geometric traits of lane lines and the intrinsic +spatial attributes of LiDAR data to design a simple yet effective automatic +annotation pipeline for generating finer lane labels. To propel future +research, we propose a novel LiDAR-based 3D lane detection model, LiLaDet, +incorporating the spatial geometry learning of the LiDAR point cloud into +Bird's Eye View (BEV) based lane identification. Experimental results indicate +that LiLaDet outperforms existing camera- and LiDAR-based approaches in the 3D +lane detection task on the K-Lane dataset and our LiSV-3DLane.",cs.CV,['cs.CV'] +Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection,Chuangchuang Tan · Huan Liu · Yao Zhao · Shikui Wei · Guanghua Gu · Ping Liu · Yunchao Wei,https://github.com/chuangchuangtan/NPR-DeepfakeDetection,https://arxiv.org/abs/2312.10461,,2312.10461.pdf,Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection,"Recently, the proliferation of highly realistic synthetic images, facilitated +through a variety of GANs and Diffusions, has significantly heightened the +susceptibility to misuse. While the primary focus of deepfake detection has +traditionally centered on the design of detection algorithms, an investigative +inquiry into the generator architectures has remained conspicuously absent in +recent years. This paper contributes to this lacuna by rethinking the +architectures of CNN-based generators, thereby establishing a generalized +representation of synthetic artifacts. Our findings illuminate that the +up-sampling operator can, beyond frequency-based artifacts, produce generalized +forgery artifacts. In particular, the local interdependence among image pixels +caused by upsampling operators is significantly demonstrated in synthetic +images generated by GAN or diffusion. Building upon this observation, we +introduce the concept of Neighboring Pixel Relationships(NPR) as a means to +capture and characterize the generalized structural artifacts stemming from +up-sampling operations. A comprehensive analysis is conducted on an open-world +dataset, comprising samples generated by \tft{28 distinct generative models}. +This analysis culminates in the establishment of a novel state-of-the-art +performance, showcasing a remarkable \tft{11.6\%} improvement over existing +methods. The code is available at +https://github.com/chuangchuangtan/NPR-DeepfakeDetection.",cs.CV,['cs.CV'] +SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection,JUNSU KIM · Hoseong Cho · Jihyeon Kim · Yihalem Tiruneh · Seungryul Baek, ,https://arxiv.org/abs/2402.17323,,2402.17323.pdf,SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection,"In the field of class incremental learning (CIL), generative replay has +become increasingly prominent as a method to mitigate the catastrophic +forgetting, alongside the continuous improvements in generative models. +However, its application in class incremental object detection (CIOD) has been +significantly limited, primarily due to the complexities of scenes involving +multiple labels. In this paper, we propose a novel approach called stable +diffusion deep generative replay (SDDGR) for CIOD. Our method utilizes a +diffusion-based generative model with pre-trained text-to-diffusion networks to +generate realistic and diverse synthetic images. SDDGR incorporates an +iterative refinement strategy to produce high-quality images encompassing old +classes. Additionally, we adopt an L2 knowledge distillation technique to +improve the retention of prior knowledge in synthetic images. Furthermore, our +approach includes pseudo-labeling for old objects within new task images, +preventing misclassification as background elements. Extensive experiments on +the COCO 2017 dataset demonstrate that SDDGR significantly outperforms existing +algorithms, achieving a new state-of-the-art in various CIOD scenarios. The +source code will be made available to the public.",cs.CV,['cs.CV'] +Mean-Shift Feature Transformer,Takumi Kobayashi, ,https://arxiv.org/abs/2404.11062,,2404.11062.pdf,Generation of a precise time scale assisted by a near-continuously operating optical lattice clock,"We report on a reduced time variation of a time scale with respect to +Coordinated Universal Time (UTC) by steering a hydrogen-maser-based time scale +with a near-continuously operating optical lattice clock. The time scale is +generated in a post-processing analysis for 230 days with a hydrogen maser with +its fractional frequency stability limited by a flicker floor of +$2\times10^{-15}$ and an Yb optical lattice clock operated with an uptime of +81.6 $\%$. During the 230-day period, the root mean square time variation of +our time scale with respect to UTC is 0.52 ns, which is a better performance +compared with those of time scales steered by microwave fountain clocks that +exhibit root mean square variations from 0.99 ns to 1.6 ns. With the high +uptime achieved by the Yb optical lattice clock, our simulation implies the +potential of generating a state-of-the-art time scale with a time variation of +$<0.1$ ns over a month using a better hydrogen maser reaching the mid +$10^{-16}$ level. This work demonstrates that a use of an optical clock with a +high uptime enhances the stability of a time scale.",physics.atom-ph,['physics.atom-ph'] +TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding,Zhihao Zhang · Shengcao Cao · Yu-Xiong Wang, ,https://arxiv.org/abs/2402.18490,,2402.18490.pdf,TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding,"The limited scale of current 3D shape datasets hinders the advancements in 3D +shape understanding, and motivates multi-modal learning approaches which +transfer learned knowledge from data-abundant 2D image and language modalities +to 3D shapes. However, even though the image and language representations have +been aligned by cross-modal models like CLIP, we find that the image modality +fails to contribute as much as the language in existing multi-modal 3D +representation learning methods. This is attributed to the domain shift in the +2D images and the distinct focus of each modality. To more effectively leverage +both modalities in the pre-training, we introduce TriAdapter Multi-Modal +Learning (TAMM) -- a novel two-stage learning approach based on three +synergistic adapters. First, our CLIP Image Adapter mitigates the domain gap +between 3D-rendered images and natural images, by adapting the visual +representations of CLIP for synthetic image-text pairs. Subsequently, our Dual +Adapters decouple the 3D shape representation space into two complementary +sub-spaces: one focusing on visual attributes and the other for semantic +understanding, which ensure a more comprehensive and effective multi-modal +pre-training. Extensive experiments demonstrate that TAMM consistently enhances +3D representations for a wide range of 3D encoder architectures, pre-training +datasets, and downstream tasks. Notably, we boost the zero-shot classification +accuracy on Objaverse-LVIS from 46.8\% to 50.7\%, and improve the 5-way 10-shot +linear probing classification accuracy on ModelNet40 from 96.1\% to 99.0\%. +Project page: https://alanzhangcs.github.io/tamm-page.",cs.CV,['cs.CV'] +Open-Vocabulary 3D Semantic Segmentation with Foundation Models,Li Jiang · Shaoshuai Shi · Bernt Schiele, ,https://arxiv.org/abs/2306.13631,,2306.13631.pdf,OpenMask3D: Open-Vocabulary 3D Instance Segmentation,"We introduce the task of open-vocabulary 3D instance segmentation. Current +approaches for 3D instance segmentation can typically only recognize object +categories from a pre-defined closed set of classes that are annotated in the +training datasets. This results in important limitations for real-world +applications where one might need to perform tasks guided by novel, +open-vocabulary queries related to a wide variety of objects. Recently, +open-vocabulary 3D scene understanding methods have emerged to address this +problem by learning queryable features for each point in the scene. While such +a representation can be directly employed to perform semantic segmentation, +existing methods cannot separate multiple object instances. In this work, we +address this limitation, and propose OpenMask3D, which is a zero-shot approach +for open-vocabulary 3D instance segmentation. Guided by predicted +class-agnostic 3D instance masks, our model aggregates per-mask features via +multi-view fusion of CLIP-based image embeddings. Experiments and ablation +studies on ScanNet200 and Replica show that OpenMask3D outperforms other +open-vocabulary methods, especially on the long-tail distribution. Qualitative +experiments further showcase OpenMask3D's ability to segment object properties +based on free-form queries describing geometry, affordances, and materials.",cs.CV,['cs.CV'] +Multiplane Prior Guided Few-Shot Aerial Scene Rendering,Zihan Gao · Licheng Jiao · Lingling Li · Xu Liu · Fang Liu · Puhua Chen · Yuwei Guo, ,http://export.arxiv.org/abs/2402.16407,,2402.16407.pdf,CMC: Few-shot Novel View Synthesis via Cross-view Multiplane Consistency,"Neural Radiance Field (NeRF) has shown impressive results in novel view +synthesis, particularly in Virtual Reality (VR) and Augmented Reality (AR), +thanks to its ability to represent scenes continuously. However, when just a +few input view images are available, NeRF tends to overfit the given views and +thus make the estimated depths of pixels share almost the same value. Unlike +previous methods that conduct regularization by introducing complex priors or +additional supervisions, we propose a simple yet effective method that +explicitly builds depth-aware consistency across input views to tackle this +challenge. Our key insight is that by forcing the same spatial points to be +sampled repeatedly in different input views, we are able to strengthen the +interactions between views and therefore alleviate the overfitting problem. To +achieve this, we build the neural networks on layered representations +(\textit{i.e.}, multiplane images), and the sampling point can thus be +resampled on multiple discrete planes. Furthermore, to regularize the unseen +target views, we constrain the rendered colors and depths from different input +views to be the same. Although simple, extensive experiments demonstrate that +our proposed method can achieve better synthesis quality over state-of-the-art +methods.",cs.CV,"['cs.CV', 'cs.GR']" +One-step Diffusion with Distribution Matching Distillation,Tianwei Yin · Michaël Gharbi · Michaël Gharbi · Richard Zhang · Eli Shechtman · Fredo Durand · William Freeman · Taesung Park, ,https://arxiv.org/abs/2311.18828,,2311.18828.pdf,One-step Diffusion with Distribution Matching Distillation,"Diffusion models generate high-quality images but require dozens of forward +passes. We introduce Distribution Matching Distillation (DMD), a procedure to +transform a diffusion model into a one-step image generator with minimal impact +on image quality. We enforce the one-step image generator match the diffusion +model at distribution level, by minimizing an approximate KL divergence whose +gradient can be expressed as the difference between 2 score functions, one of +the target distribution and the other of the synthetic distribution being +produced by our one-step generator. The score functions are parameterized as +two diffusion models trained separately on each distribution. Combined with a +simple regression loss matching the large-scale structure of the multi-step +diffusion outputs, our method outperforms all published few-step diffusion +approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot +COCO-30k, comparable to Stable Diffusion but orders of magnitude faster. +Utilizing FP16 inference, our model generates images at 20 FPS on modern +hardware.",cs.CV,['cs.CV'] +Towards 3D Vision with Low-Cost Single-Photon Cameras,Fangzhou Mu · Carter Sifferman · Sacha Jungerman · Yiquan Li · Zhiyue Han · Michael Gleicher · Mohit Gupta · Yin Li,https://cpsiff.github.io/towards_3d_vision/,https://arxiv.org/abs/2403.17801,,2403.17801.pdf,Towards 3D Vision with Low-Cost Single-Photon Cameras,"We present a method for reconstructing 3D shape of arbitrary Lambertian +objects based on measurements by miniature, energy-efficient, low-cost +single-photon cameras. These cameras, operating as time resolved image sensors, +illuminate the scene with a very fast pulse of diffuse light and record the +shape of that pulse as it returns back from the scene at a high temporal +resolution. We propose to model this image formation process, account for its +non-idealities, and adapt neural rendering to reconstruct 3D geometry from a +set of spatially distributed sensors with known poses. We show that our +approach can successfully recover complex 3D shapes from simulated data. We +further demonstrate 3D object reconstruction from real-world captures, +utilizing measurements from a commodity proximity sensor. Our work draws a +connection between image-based modeling and active range scanning and is a step +towards 3D vision with single-photon cameras.",cs.CV,"['cs.CV', 'eess.IV']" +RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation,Yi Rong · Haoran Zhou · Kang Xia · Cheng Mei · Jiahao Wang · Tong Lu, ,,https://www.mdpi.com/2072-4292/16/3/450,,,,,nan +UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence,Ruihai Wu · Haoran Lu · Yiyan Wang · Yubo Wang · Hao Dong, ,https://arxiv.org/abs/2405.06903,,2405.06903.pdf,UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence,"Garment manipulation (e.g., unfolding, folding and hanging clothes) is +essential for future robots to accomplish home-assistant tasks, while highly +challenging due to the diversity of garment configurations, geometries and +deformations. Although able to manipulate similar shaped garments in a certain +task, previous works mostly have to design different policies for different +tasks, could not generalize to garments with diverse geometries, and often rely +heavily on human-annotated data. In this paper, we leverage the property that, +garments in a certain category have similar structures, and then learn the +topological dense (point-level) visual correspondence among garments in the +category level with different deformations in the self-supervised manner. The +topological correspondence can be easily adapted to the functional +correspondence to guide the manipulation policies for various downstream tasks, +within only one or few-shot demonstrations. Experiments over garments in 3 +different categories on 3 representative tasks in diverse scenarios, using one +or two arms, taking one or more steps, inputting flat or messy garments, +demonstrate the effectiveness of our proposed method. Project page: +https://warshallrho.github.io/unigarmentmanip.",cs.CV,['cs.CV'] +Learning Diffusion Texture Priors for Image Restoration,Tian Ye · Sixiang Chen · Wenhao Chai · Zhaohu Xing · Jing Qin · Ge lin · Lei Zhu, ,https://arxiv.org/abs/2312.08606,,2312.08606.pdf,VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook,"Night photography often struggles with challenges like low light and +blurring, stemming from dark environments and prolonged exposures. Current +methods either disregard priors and directly fitting end-to-end networks, +leading to inconsistent illumination, or rely on unreliable handcrafted priors +to constrain the network, thereby bringing the greater error to the final +result. We believe in the strength of data-driven high-quality priors and +strive to offer a reliable and consistent prior, circumventing the restrictions +of manual priors. In this paper, we propose Clearer Night Image Restoration +with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent +restoration outcomes on real-world and synthetic benchmarks. To ensure the +faithful restoration of details and illumination, we propose the incorporation +of two essential modules: the Adaptive Illumination Enhancement Module (AIEM) +and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM +leverages the inter-channel correlation of features to dynamically maintain +illumination consistency between degraded features and high-quality codebook +features. Meanwhile, the DBCA module effectively integrates texture and +structural information through bi-directional cross-attention and deformable +convolution, resulting in enhanced fine-grained detail and structural fidelity +across parallel decoders. Extensive experiments validate the remarkable +benefits of VQCNIR in enhancing image quality under low-light conditions, +showcasing its state-of-the-art performance on both synthetic and real-world +datasets. The code is available at https://github.com/AlexZou14/VQCNIR.",cs.CV,['cs.CV'] +Move Anything with Layered Scene Diffusion,Jiawei Ren · Mengmeng Xu · Jui-Chieh Wu · Ziwei Liu · Tao Xiang · Antoine Toisoul, ,https://arxiv.org/abs/2404.07178,,2404.07178.pdf,Move Anything with Layered Scene Diffusion,"Diffusion models generate images with an unprecedented level of quality, but +how can we freely rearrange image layouts? Recent works generate controllable +scenes via learning spatially disentangled latent codes, but these methods do +not apply to diffusion models due to their fixed forward process. In this work, +we propose SceneDiffusion to optimize a layered scene representation during the +diffusion sampling process. Our key insight is that spatial disentanglement can +be obtained by jointly denoising scene renderings at different spatial layouts. +Our generated scenes support a wide range of spatial editing operations, +including moving, resizing, cloning, and layer-wise appearance editing +operations, including object restyling and replacing. Moreover, a scene can be +generated conditioned on a reference image, thus enabling object moving for +in-the-wild images. Notably, this approach is training-free, compatible with +general text-to-image diffusion models, and responsive in less than a second.",cs.CV,['cs.CV'] +MoML: Online Meta Adaptation for 3D Human Motion Prediction,Xiaoning Sun · Huaijiang Sun · Bin Li · Dong Wei · Weiqing Li · Jianfeng Lu, ,https://arxiv.org/abs/2405.02911,,,Multimodal Sense-Informed Prediction of 3D Human Motions,"Predicting future human pose is a fundamental application for machine +intelligence, which drives robots to plan their behavior and paths ahead of +time to seamlessly accomplish human-robot collaboration in real-world 3D +scenarios. Despite encouraging results, existing approaches rarely consider the +effects of the external scene on the motion sequence, leading to pronounced +artifacts and physical implausibilities in the predictions. To address this +limitation, this work introduces a novel multi-modal sense-informed motion +prediction approach, which conditions high-fidelity generation on two modal +information: external 3D scene, and internal human gaze, and is able to +recognize their salience for future human activity. Furthermore, the gaze +information is regarded as the human intention, and combined with both motion +and scene features, we construct a ternary intention-aware attention to +supervise the generation to match where the human wants to reach. Meanwhile, we +introduce semantic coherence-aware attention to explicitly distinguish the +salient point clouds and the underlying ones, to ensure a reasonable +interaction of the generated sequence with the 3D scene. On two real-world +benchmarks, the proposed method achieves state-of-the-art performance both in +3D human pose and trajectory prediction.",cs.CV,['cs.CV'] +Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark,Ziyang Chen · Israel D. Gebru · Christian Richardt · Anurag Kumar · William Laney · Andrew Owens · Alexander Richard, ,,https://openreview.net/forum?id=Mk0Uf3zHtU,,,,,nan +Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling,Shentong Mo · Pedro Morgado, ,https://arxiv.org/abs/2312.01017,,2312.01017.pdf,Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling,"Humans possess a remarkable ability to integrate auditory and visual +information, enabling a deeper understanding of the surrounding environment. +This early fusion of audio and visual cues, demonstrated through cognitive +psychology and neuroscience research, offers promising potential for developing +multimodal perception models. However, training early fusion architectures +poses significant challenges, as the increased model expressivity requires +robust learning frameworks to harness their enhanced capabilities. In this +paper, we address this challenge by leveraging the masked reconstruction +framework, previously successful in unimodal settings, to train audio-visual +encoders with early fusion. Additionally, we propose an attention-based fusion +module that captures interactions between local audio and visual +representations, enhancing the model's ability to capture fine-grained +interactions. While effective, this procedure can become computationally +intractable, as the number of local representations increases. Thus, to address +the computational complexity, we propose an alternative procedure that +factorizes the local representations before representing audio-visual +interactions. Extensive evaluations on a variety of datasets demonstrate the +superiority of our approach in audio-event classification, visual sound +localization, sound separation, and audio-visual segmentation. These +contributions enable the efficient training of deeply integrated audio-visual +models and significantly advance the usefulness of early fusion architectures.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM', 'cs.SD']" +HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses,Caoyuan Ma · Yu-Lun Liu · Zhixiang Wang · Wu Liu · Xinchen Liu · Zheng Wang, ,https://arxiv.org/abs/2312.02232,,2312.02232.pdf,HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses,"We present HumanNeRF-SE, a simple yet effective method that synthesizes +diverse novel pose images with simple input. Previous HumanNeRF works require a +large number of optimizable parameters to fit the human images. Instead, we +reload these approaches by combining explicit and implicit human +representations to design both generalized rigid deformation and specific +non-rigid deformation. Our key insight is that explicit shape can reduce the +sampling points used to fit implicit representation, and frozen blending +weights from SMPL constructing a generalized rigid deformation can effectively +avoid overfitting and improve pose generalization performance. Our architecture +involving both explicit and implicit representation is simple yet effective. +Experiments demonstrate our model can synthesize images under arbitrary poses +with few-shot input and increase the speed of synthesizing images by 15 times +through a reduction in computational complexity without using any existing +acceleration modules. Compared to the state-of-the-art HumanNeRF studies, +HumanNeRF-SE achieves better performance with fewer learnable parameters and +less training time.",cs.CV,['cs.CV'] +GALA: Generating Animatable Layered Assets from a Single Scan,Taeksoo Kim · Byungjun Kim · Shunsuke Saito · Hanbyul Joo, ,https://arxiv.org/abs/2401.12979,,2401.12979.pdf,GALA: Generating Animatable Layered Assets from a Single Scan,"We present GALA, a framework that takes as input a single-layer clothed 3D +human mesh and decomposes it into complete multi-layered 3D assets. The outputs +can then be combined with other assets to create novel clothed human avatars +with any pose. Existing reconstruction approaches often treat clothed humans as +a single-layer of geometry and overlook the inherent compositionality of humans +with hairstyles, clothing, and accessories, thereby limiting the utility of the +meshes for downstream applications. Decomposing a single-layer mesh into +separate layers is a challenging task because it requires the synthesis of +plausible geometry and texture for the severely occluded regions. Moreover, +even with successful decomposition, meshes are not normalized in terms of poses +and body shapes, failing coherent composition with novel identities and poses. +To address these challenges, we propose to leverage the general knowledge of a +pretrained 2D diffusion model as geometry and appearance prior for humans and +other assets. We first separate the input mesh using the 3D surface +segmentation extracted from multi-view 2D segmentations. Then we synthesize the +missing geometry of different layers in both posed and canonical spaces using a +novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete +inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its +texture to obtain the complete appearance including the initially occluded +regions. Through a series of decomposition steps, we obtain multiple layers of +3D assets in a shared canonical space normalized in terms of poses and human +shapes, hence supporting effortless composition to novel identities and +reanimation with novel poses. Our experiments demonstrate the effectiveness of +our approach for decomposition, canonicalization, and composition tasks +compared to existing solutions.",cs.CV,['cs.CV'] +A Vision Check-up for Language Models,Pratyusha Sharma · Tamar Rott Shaham · Manel Baradad · Stephanie Fu · Adrian Rodriguez-Munoz · Shivam Duggal · Phillip Isola · Antonio Torralba, ,https://arxiv.org/abs/2401.01862,,2401.01862.pdf,A Vision Check-up for Language Models,"What does learning to model relationships between strings teach large +language models (LLMs) about the visual world? We systematically evaluate LLMs' +abilities to generate and recognize an assortment of visual concepts of +increasing complexity and then demonstrate how a preliminary visual +representation learning system can be trained using models of text. As language +models lack the ability to consume or output visual information as pixels, we +use code to represent images in our study. Although LLM-generated images do not +look like natural images, results on image generation and the ability of models +to correct these generated images indicate that precise modeling of strings can +teach language models about numerous aspects of the visual world. Furthermore, +experiments on self-supervised visual representation learning, utilizing images +generated with text models, highlight the potential to train vision models +capable of making semantic assessments of natural images using just LLMs.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +Boosting Adversarial Transferability by Block Shuffle and Rotation,Kunyu Wang · he xuanran · Wenxuan Wang · Xiaosen Wang, ,https://arxiv.org/abs/2308.10299,,2308.10299.pdf,Boosting Adversarial Transferability by Block Shuffle and Rotation,"Adversarial examples mislead deep neural networks with imperceptible +perturbations and have brought significant threats to deep learning. An +important aspect is their transferability, which refers to their ability to +deceive other models, thus enabling attacks in the black-box setting. Though +various methods have been proposed to boost transferability, the performance +still falls short compared with white-box attacks. In this work, we observe +that existing input transformation based attacks, one of the mainstream +transfer-based attacks, result in different attention heatmaps on various +models, which might limit the transferability. We also find that breaking the +intrinsic relation of the image can disrupt the attention heatmap of the +original image. Based on this finding, we propose a novel input transformation +based attack called block shuffle and rotation (BSR). Specifically, BSR splits +the input image into several blocks, then randomly shuffles and rotates these +blocks to construct a set of new images for gradient calculation. Empirical +evaluations on the ImageNet dataset demonstrate that BSR could achieve +significantly better transferability than the existing input transformation +based methods under single-model and ensemble-model settings. Combining BSR +with the current input transformation method can further improve the +transferability, which significantly outperforms the state-of-the-art methods. +Code is available at https://github.com/Trustworthy-AI-Group/BSR",cs.CV,"['cs.CV', 'eess.IV']" +Residual Learning in Diffusion Models,Junyu Zhang · Daochang Liu · Eunbyung Park · Shichao Zhang · Chang Xu, ,https://arxiv.org/abs/2308.13712,,2308.13712.pdf,Residual Denoising Diffusion Models,"We propose residual denoising diffusion models (RDDM), a novel dual diffusion +process that decouples the traditional single denoising diffusion process into +residual diffusion and noise diffusion. This dual diffusion framework expands +the denoising-based diffusion models, initially uninterpretable for image +restoration, into a unified and interpretable model for both image generation +and restoration by introducing residuals. Specifically, our residual diffusion +represents directional diffusion from the target image to the degraded input +image and explicitly guides the reverse generation process for image +restoration, while noise diffusion represents random perturbations in the +diffusion process. The residual prioritizes certainty, while the noise +emphasizes diversity, enabling RDDM to effectively unify tasks with varying +certainty or diversity requirements, such as image generation and restoration. +We demonstrate that our sampling process is consistent with that of DDPM and +DDIM through coefficient transformation, and propose a partially +path-independent generation process to better understand the reverse process. +Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a +batch size of 1, to compete with state-of-the-art image restoration methods. We +provide code and pre-trained models to encourage further exploration, +application, and development of our innovative framework +(https://github.com/nachifur/RDDM).",cs.CV,"['cs.CV', 'cs.LG']" +"What, How, and When Should Object Detectors Update in Continually Changing Test Domains?",Jayeon Yoo · Dongkwan Lee · Inseop Chung · Donghyun Kim · Nojun Kwak, ,https://arxiv.org/abs/2312.08875,,2312.08875.pdf,"What, How, and When Should Object Detectors Update in Continually Changing Test Domains?","It is a well-known fact that the performance of deep learning models +deteriorates when they encounter a distribution shift at test time. Test-time +adaptation (TTA) algorithms have been proposed to adapt the model online while +inferring test data. However, existing research predominantly focuses on +classification tasks through the optimization of batch normalization layers or +classification heads, but this approach limits its applicability to various +model architectures like Transformers and makes it challenging to apply to +other tasks, such as object detection. In this paper, we propose a novel online +adaption approach for object detection in continually changing test domains, +considering which part of the model to update, how to update it, and when to +perform the update. By introducing architecture-agnostic and lightweight +adaptor modules and only updating these while leaving the pre-trained backbone +unchanged, we can rapidly adapt to new test domains in an efficient way and +prevent catastrophic forgetting. Furthermore, we present a practical and +straightforward class-wise feature aligning method for object detection to +resolve domain shifts. Additionally, we enhance efficiency by determining when +the model is sufficiently adapted or when additional adaptation is needed due +to changes in the test distribution. Our approach surpasses baselines on widely +used benchmarks, achieving improvements of up to 4.9\%p and 7.9\%p in mAP for +COCO $\rightarrow$ COCO-corrupted and SHIFT, respectively, while maintaining +about 20 FPS or higher.",cs.CV,['cs.CV'] +Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement,Daiwei Yu · Zhuorong Li · Lina Wei · Canghong Jin · Yun Zhang · Sixian Chan, ,https://arxiv.org/abs/2403.09101,,2403.09101.pdf,Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement,"Adversarial training (AT) is currently one of the most effective ways to +obtain the robustness of deep neural networks against adversarial attacks. +However, most AT methods suffer from robust overfitting, i.e., a significant +generalization gap in adversarial robustness between the training and testing +curves. In this paper, we first identify a connection between robust +overfitting and the excessive memorization of noisy labels in AT from a view of +gradient norm. As such label noise is mainly caused by a distribution mismatch +and improper label assignments, we are motivated to propose a label refinement +approach for AT. Specifically, our Self-Guided Label Refinement first +self-refines a more accurate and informative label distribution from +over-confident hard labels, and then it calibrates the training by dynamically +incorporating knowledge from self-distilled models into the current model and +thus requiring no external teachers. Empirical results demonstrate that our +method can simultaneously boost the standard accuracy and robust performance +across multiple benchmark datasets, attack types, and architectures. In +addition, we also provide a set of analyses from the perspectives of +information theory to dive into our method and suggest the importance of soft +labels for robust generalization.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']" +SAOR: Single-View Articulated Object Reconstruction,Mehmet Aygun · Oisin Mac Aodha, ,,https://synthical.com/article/e8c0baeb-d277-4528-b526-8a08fcc46a22,,,,,nan +Infrared Adversarial Car Stickers,Xiaopei Zhu · Yuqiu Liu · Zhanhao Hu · Jianmin Li · Xiaolin Hu, ,https://arxiv.org/abs/2405.09924,,2405.09924.pdf,Infrared Adversarial Car Stickers,"Infrared physical adversarial examples are of great significance for studying +the security of infrared AI systems that are widely used in our lives such as +autonomous driving. Previous infrared physical attacks mainly focused on 2D +infrared pedestrian detection which may not fully manifest its destructiveness +to AI systems. In this work, we propose a physical attack method against +infrared detectors based on 3D modeling, which is applied to a real car. The +goal is to design a set of infrared adversarial stickers to make cars invisible +to infrared detectors at various viewing angles, distances, and scenes. We +build a 3D infrared car model with real infrared characteristics and propose an +infrared adversarial pattern generation method based on 3D mesh shadow. We +propose a 3D control points-based mesh smoothing algorithm and use a set of +smoothness loss functions to enhance the smoothness of adversarial meshes and +facilitate the sticker implementation. Besides, We designed the aluminum +stickers and conducted physical experiments on two real Mercedes-Benz A200L +cars. Our adversarial stickers hid the cars from Faster RCNN, an object +detector, at various viewing angles, distances, and scenes. The attack success +rate (ASR) was 91.49% for real cars. In comparison, the ASRs of random stickers +and no sticker were only 6.21% and 0.66%, respectively. In addition, the ASRs +of the designed stickers against six unseen object detectors such as YOLOv3 and +Deformable DETR were between 73.35%-95.80%, showing good transferability of the +attack performance across detectors.",cs.CV,['cs.CV'] +Effective Video Mirror Detection with Inconsistent Motion Cues,Alex Warren · Ke Xu · Jiaying Lin · Gary Tam · Rynson W.H. Lau, ,,https://cronfa.swan.ac.uk/Record/cronfa65886/Details,,,,,nan +Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer,Hyeongjin Nam · Daniel Jung · Gyeongsik Moon · Kyoung Mu Lee,https://github.com/dqj5182/CONTHO_RELEASE,https://arxiv.org/abs/2404.04819,,2404.04819.pdf,Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer,"Human-object contact serves as a strong cue to understand how humans +physically interact with objects. Nevertheless, it is not widely explored to +utilize human-object contact information for the joint reconstruction of 3D +human and object from a single image. In this work, we present a novel joint 3D +human-object reconstruction method (CONTHO) that effectively exploits contact +information between humans and objects. There are two core designs in our +system: 1) 3D-guided contact estimation and 2) contact-based 3D human and +object refinement. First, for accurate human-object contact estimation, CONTHO +initially reconstructs 3D humans and objects and utilizes them as explicit 3D +guidance for contact estimation. Second, to refine the initial reconstructions +of 3D human and object, we propose a novel contact-based refinement Transformer +that effectively aggregates human features and object features based on the +estimated human-object contact. The proposed contact-based refinement prevents +the learning of erroneous correlation between human and object, which enables +accurate 3D reconstruction. As a result, our CONTHO achieves state-of-the-art +performance in both human-object contact estimation and joint reconstruction of +3D human and object. The code is publicly available at +https://github.com/dqj5182/CONTHO_RELEASE.",cs.CV,['cs.CV'] +SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model,Inhwan Bae · Young-Jae Park · Hae-Gon Jeon,https://github.com/InhwanBae/SingularTrajectory,https://arxiv.org/abs/2403.18452v1,,2403.18452v1.pdf,SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model,"There are five types of trajectory prediction tasks: deterministic, +stochastic, domain adaptation, momentary observation, and few-shot. These +associated tasks are defined by various factors, such as the length of input +paths, data split and pre-processing methods. Interestingly, even though they +commonly take sequential coordinates of observations as input and infer future +paths in the same coordinates as output, designing specialized architectures +for each task is still necessary. For the other task, generality issues can +lead to sub-optimal performances. In this paper, we propose SingularTrajectory, +a diffusion-based universal trajectory prediction framework to reduce the +performance gap across the five tasks. The core of SingularTrajectory is to +unify a variety of human dynamics representations on the associated tasks. To +do this, we first build a Singular space to project all types of motion +patterns from each task into one embedding space. We next propose an adaptive +anchor working in the Singular space. Unlike traditional fixed anchor methods +that sometimes yield unacceptable paths, our adaptive anchor enables correct +anchors, which are put into a wrong location, based on a traversability map. +Finally, we adopt a diffusion-based predictor to further enhance the prototype +paths using a cascaded denoising process. Our unified framework ensures the +generality across various benchmark settings such as input modality, and +trajectory lengths. Extensive experiments on five public benchmarks demonstrate +that SingularTrajectory substantially outperforms existing models, highlighting +its effectiveness in estimating general dynamics of human movements. Code is +publicly available at https://github.com/inhwanbae/SingularTrajectory .",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']" +Gradient Reweighting: Towards Imbalanced Class-Incremental Learning,Jiangpeng He,https://github.com/JiangpengHe/imbalanced_cil,https://arxiv.org/abs/2402.18528,,2402.18528.pdf,Gradient Reweighting: Towards Imbalanced Class-Incremental Learning,"Class-Incremental Learning (CIL) trains a model to continually recognize new +classes from non-stationary data while retaining learned knowledge. A major +challenge of CIL arises when applying to real-world data characterized by +non-uniform distribution, which introduces a dual imbalance problem involving +(i) disparities between stored exemplars of old tasks and new class data +(inter-phase imbalance), and (ii) severe class imbalances within each +individual task (intra-phase imbalance). We show that this dual imbalance issue +causes skewed gradient updates with biased weights in FC layers, thus inducing +over/under-fitting and catastrophic forgetting in CIL. Our method addresses it +by reweighting the gradients towards balanced optimization and unbiased +classifier learning. Additionally, we observe imbalanced forgetting where +paradoxically the instance-rich classes suffer higher performance degradation +during CIL due to a larger amount of training data becoming unavailable in +subsequent learning phases. To tackle this, we further introduce a +distribution-aware knowledge distillation loss to mitigate forgetting by +aligning output logits proportionally with the distribution of lost training +data. We validate our method on CIFAR-100, ImageNetSubset, and Food101 across +various evaluation protocols and demonstrate consistent improvements compared +to existing works, showing great potential to apply CIL in real-world scenarios +with enhanced robustness and effectiveness.",cs.CV,['cs.CV'] +OpenEQA: Embodied Question Answering in the Era of Foundation Models,Arjun Majumdar · Anurag Ajay · Xiaohan Zhang · Sriram Yenamandra · Mikael Henaff · Alexander Sax · Sneha Silwal · Paul McVay · Oleksandr Maksymets · Sergio Arnaud · Pranav Putta · Karmesh Yadav · Qiyang Li · Benjamin Newman · Mohit Sharma · Mohit Sharma · Vincent-Pierre Berges · Shiqi Zhang · Pulkit Agrawal · Dhruv Batra · Yonatan Bisk · Mrinal Kalakrishnan · Franziska Meier · Chris Paxton · Aravind Rajeswaran, ,,https://openreview.net/forum?id=7JIW6e1UJX,,,,,nan +Batch Normalization Alleviates the Spectral Bias in Coordinate Networks,Zhicheng Cai · Hao Zhu · Qiu Shen · Xinran Wang · Xun Cao, ,https://arxiv.org/abs/2306.16999,,2306.16999.pdf,Spectral Batch Normalization: Normalization in the Frequency Domain,"Regularization is a set of techniques that are used to improve the +generalization ability of deep neural networks. In this paper, we introduce +spectral batch normalization (SBN), a novel effective method to improve +generalization by normalizing feature maps in the frequency (spectral) domain. +The activations of residual networks without batch normalization (BN) tend to +explode exponentially in the depth of the network at initialization. This leads +to extremely large feature map norms even though the parameters are relatively +small. These explosive dynamics can be very detrimental to learning. BN makes +weight decay regularization on the scaling factors $\gamma, \beta$ +approximately equivalent to an additive penalty on the norm of the feature +maps, which prevents extremely large feature map norms to a certain degree. +However, we show experimentally that, despite the approximate additive penalty +of BN, feature maps in deep neural networks (DNNs) tend to explode at the +beginning of the network and that feature maps of DNNs contain large values +during the whole training. This phenomenon also occurs in a weakened form in +non-residual networks. SBN addresses large feature maps by normalizing them in +the frequency domain. In our experiments, we empirically show that SBN prevents +exploding feature maps at initialization and large feature map values during +the training. Moreover, the normalization of feature maps in the frequency +domain leads to more uniform distributed frequency components. This discourages +the DNNs to rely on single frequency components of feature maps. These, +together with other effects of SBN, have a regularizing effect on the training +of residual and non-residual networks. We show experimentally that using SBN in +addition to standard regularization methods improves the performance of DNNs by +a relevant margin, e.g. ResNet50 on ImageNet by 0.71%.",cs.CV,"['cs.CV', 'cs.LG']" +Learning for Transductive Threshold Calibration in Open-World Recognition,Qin ZHANG · DONGSHENG An · Tianjun Xiao · Tong He · Qingming Tang · Ying Nian Wu · Joseph Tighe · Yifan Xing, ,,https://synthical.com/summary/ed7531f5-2d4e-43c1-95e3-15ec48a9b43d,,,,,nan +MatSynth: A Modern PBR Materials Dataset,Giuseppe Vecchio · Valentin Deschaintre,https://gvecchio.com/matsynth/,https://arxiv.org/abs/2401.06056,,2401.06056.pdf,MatSynth: A Modern PBR Materials Dataset,"We introduce MatSynth, a dataset of 4,000+ CC0 ultra-high resolution PBR +materials. Materials are crucial components of virtual relightable assets, +defining the interaction of light at the surface of geometries. Given their +importance, significant research effort was dedicated to their representation, +creation and acquisition. However, in the past 6 years, most research in +material acquisiton or generation relied either on the same unique dataset, or +on company-owned huge library of procedural materials. With this dataset we +propose a significantly larger, more diverse, and higher resolution set of +materials than previously publicly available. We carefully discuss the data +collection process and demonstrate the benefits of this dataset on material +acquisition and generation applications. The complete data further contains +metadata with each material's origin, license, category, tags, creation method +and, when available, descriptions and physical size, as well as 3M+ renderings +of the augmented materials, in 1K, under various environment lightings. The +MatSynth dataset is released through the project page at: +https://www.gvecchio.com/matsynth.",cs.CV,"['cs.CV', 'cs.GR']" +Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation,Xiaoyang Chen · Hao Zheng · Yuemeng LI · Yuncong Ma · Liang Ma · Hongming Li · Yong Fan, ,https://arxiv.org/abs/2311.10696,,2311.10696.pdf,Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation,"A versatile medical image segmentation model applicable to images acquired +with diverse equipment and protocols can facilitate model deployment and +maintenance. However, building such a model typically demands a large, diverse, +and fully annotated dataset, which is challenging to obtain due to the +labor-intensive nature of data curation. To address this challenge, we propose +a cost-effective alternative that harnesses multi-source data with only partial +or sparse segmentation labels for training, substantially reducing the cost of +developing a versatile model. We devise strategies for model +self-disambiguation, prior knowledge incorporation, and imbalance mitigation to +tackle challenges associated with inconsistently labeled multi-source data, +including label ambiguity and modality, dataset, and class imbalances. +Experimental results on a multi-modal dataset compiled from eight different +sources for abdominal structure segmentation have demonstrated the +effectiveness and superior performance of our method compared to +state-of-the-art alternative approaches. We anticipate that its cost-saving +features, which optimize the utilization of existing annotated data and reduce +annotation efforts for new data, will have a significant impact in the field.",cs.CV,['cs.CV'] +ASAM: Boosting Segment Anything Model with Adversarial Tuning,Bo Li · Haoke Xiao · Lv Tang, ,https://arxiv.org/abs/2405.00256,,2405.00256.pdf,ASAM: Boosting Segment Anything Model with Adversarial Tuning,"In the evolving landscape of computer vision, foundation models have emerged +as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks. +Among these, the Segment Anything Model (SAM) by Meta AI has distinguished +itself in image segmentation. However, SAM, like its counterparts, encounters +limitations in specific niche applications, prompting a quest for enhancement +strategies that do not compromise its inherent capabilities. This paper +introduces ASAM, a novel methodology that amplifies SAM's performance through +adversarial tuning. We harness the potential of natural adversarial examples, +inspired by their successful implementation in natural language processing. By +utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B +dataset, generating adversarial instances that are more representative of +natural variations rather than conventional imperceptible perturbations. Our +approach maintains the photorealism of adversarial examples and ensures +alignment with original mask annotations, thereby preserving the integrity of +the segmentation task. The fine-tuned ASAM demonstrates significant +improvements across a diverse range of segmentation tasks without necessitating +additional data or architectural modifications. The results of our extensive +evaluations confirm that ASAM establishes new benchmarks in segmentation tasks, +thereby contributing to the advancement of foundational models in computer +vision. Our project page is in https://asam2024.github.io/.",cs.CV,['cs.CV'] +FreeDrag: Feature Dragging for Reliable Point-based Image Editing,Pengyang Ling · Lin Chen · Pan Zhang · Huaian Chen · Yi Jin · Jinjin Zheng, ,https://arxiv.org/abs/2307.04684,,2307.04684.pdf,FreeDrag: Feature Dragging for Reliable Point-based Image Editing,"To serve the intricate and varied demands of image editing, precise and +flexible manipulation in image content is indispensable. Recently, Drag-based +editing methods have gained impressive performance. However, these methods +predominantly center on point dragging, resulting in two noteworthy drawbacks, +namely ""miss tracking"", where difficulties arise in accurately tracking the +predetermined handle points, and ""ambiguous tracking"", where tracked points are +potentially positioned in wrong regions that closely resemble the handle +points. To address the above issues, we propose FreeDrag, a feature dragging +methodology designed to free the burden on point tracking. The FreeDrag +incorporates two key designs, i.e., template feature via adaptive updating and +line search with backtracking, the former improves the stability against +drastic content change by elaborately controls feature updating scale after +each dragging, while the latter alleviates the misguidance from similar points +by actively restricting the search area in a line. These two technologies +together contribute to a more stable semantic dragging with higher efficiency. +Comprehensive experimental results substantiate that our approach significantly +outperforms pre-existing methodologies, offering reliable point-based editing +even in various complex scenarios.",cs.CV,"['cs.CV', 'cs.HC', 'cs.LG']" +ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization,Weiyao Wang · Pierre Gleize · Hao Tang · Xingyu Chen · Kevin Liang · Matt Feiszli, ,https://arxiv.org/abs/2401.08937,,2401.08937.pdf,ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization,"Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View +Synthesis (NVS) given a set of 2D images. However, NeRF training requires +accurate camera pose for each input view, typically obtained by +Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax +this constraint, but they still often rely on decent initial poses which they +can refine. Here we aim at removing the requirement for pose initialization. We +present Incremental CONfidence (ICON), an optimization procedure for training +NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate +initial guess for poses. Further, ICON introduces ``confidence"": an adaptive +measure of model quality used to dynamically reweight gradients. ICON relies on +high-confidence poses to learn NeRF, and high-confidence 3D structure (as +encoded by NeRF) to learn poses. We show that ICON, without prior pose +initialization, achieves superior performance in both CO3D and HO3D versus +methods which use SfM pose.",cs.CV,['cs.CV'] +StyLitGAN: Image-based Relighting via Latent Control,Anand Bhattad · James Soole · David Forsyth, ,https://ar5iv.labs.arxiv.org/html/2306.00987,,2306.00987.pdf,"StyleGAN knows Normal, Depth, Albedo, and More","Intrinsic images, in the original sense, are image-like maps of scene +properties like depth, normal, albedo or shading. This paper demonstrates that +StyleGAN can easily be induced to produce intrinsic images. The procedure is +straightforward. We show that, if StyleGAN produces $G({w})$ from latents +${w}$, then for each type of intrinsic image, there is a fixed offset ${d}_c$ +so that $G({w}+{d}_c)$ is that type of intrinsic image for $G({w})$. Here +${d}_c$ is {\em independent of ${w}$}. The StyleGAN we used was pretrained by +others, so this property is not some accident of our training regime. We show +that there are image transformations StyleGAN will {\em not} produce in this +fashion, so StyleGAN is not a generic image regression engine. + It is conceptually exciting that an image generator should ``know'' and +represent intrinsic images. There may also be practical advantages to using a +generative model to produce intrinsic images. The intrinsic images obtained +from StyleGAN compare well both qualitatively and quantitatively with those +obtained by using SOTA image regression techniques; but StyleGAN's intrinsic +images are robust to relighting effects, unlike SOTA methods.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Single Mesh Diffusion Models with Field Latents for Texture Generation,Thomas W. Mitchel · Carlos Esteves · Ameesh Makadia,https://single-mesh-diffusion.github.io/,https://arxiv.org/abs/2312.09250,,2312.09250.pdf,Single Mesh Diffusion Models with Field Latents for Texture Generation,"We introduce a framework for intrinsic latent diffusion models operating +directly on the surfaces of 3D shapes, with the goal of synthesizing +high-quality textures. Our approach is underpinned by two contributions: field +latents, a latent representation encoding textures as discrete vector fields on +the mesh vertices, and field latent diffusion models, which learn to denoise a +diffusion process in the learned latent space on the surface. We consider a +single-textured-mesh paradigm, where our models are trained to generate +variations of a given texture on a mesh. We show the synthesized textures are +of superior fidelity compared those from existing single-textured-mesh +generative models. Our models can also be adapted for user-controlled editing +tasks such as inpainting and label-guided generation. The efficacy of our +approach is due in part to the equivariance of our proposed framework under +isometries, allowing our models to seamlessly reproduce details across locally +similar regions and opening the door to a notion of generative texture +transfer.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Label-Efficient Group Robustness via Out-of-Distribution Concept Curation,Yiwei Yang · Anthony Liu · Robert Wolfe · Aylin Caliskan · Bill Howe, ,https://arxiv.org/abs/2403.06392,,2403.06392.pdf,Towards Robust Out-of-Distribution Generalization Bounds via Sharpness,"Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD +generalization, still lacks appropriate theoretical guarantees. Canonical OOD +bounds focus on different distance measurements between source and target +domains but fail to consider the optimization property of the learned model. As +empirically shown in recent work, the sharpness of learned minima influences +OOD generalization. To bridge this gap between optimization and OOD +generalization, we study the effect of sharpness on how a model tolerates data +change in domain shift which is usually captured by ""robustness"" in +generalization. In this paper, we give a rigorous connection between sharpness +and robustness, which gives better OOD guarantees for robust algorithms. It +also provides a theoretical backing for ""flat minima leads to better OOD +generalization"". Overall, we propose a sharpness-based OOD generalization bound +by taking robustness into consideration, resulting in a tighter bound than +non-robust guarantees. Our findings are supported by the experiments on a ridge +regression model, as well as the experiments on deep learning classification +tasks.",cs.LG,['cs.LG'] +EventPS: Real-Time Photometric Stereo Using an Event Camera,Bohan Yu · Jieji Ren · Jin Han · Feishi Wang · Jinxiu Liang · Boxin Shi, ,https://arxiv.org/abs/2312.11911,,2312.11911.pdf,"EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State Estimation and 3D Dense Mapping","Event cameras are bio-inspired, motion-activated sensors that demonstrate +substantial potential in handling challenging situations, such as motion blur +and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the +problem of 6 DoF pose tracking and 3D reconstruction using monocular event +camera. A novel event-based hybrid tracking framework is designed to estimate +the pose, leveraging the robustness of feature matching and the precision of +direct alignment. Specifically, we develop an event-based 2D-2D alignment to +construct the photometric constraint, and tightly integrate it with the +event-based reprojection constraint. The mapping module recovers the dense and +colorful depth of the scene through the image-guided event-based mapping +method. Subsequently, the appearance, texture, and surface mesh of the 3D scene +can be reconstructed by fusing the dense depth map from multiple viewpoints +using truncated signed distance function (TSDF) fusion. To the best of our +knowledge, this is the first non-learning work to realize event-based dense +mapping. Numerical evaluations are performed on both publicly available and +self-collected datasets, which qualitatively and quantitatively demonstrate the +superior performance of our method. Our EVI-SAM effectively balances accuracy +and robustness while maintaining computational efficiency, showcasing superior +pose tracking and dense mapping performance in challenging scenarios. Video +Demo: https://youtu.be/Nn40U4e5Si8.",cs.CV,"['cs.CV', 'cs.RO']" +Towards Understanding and Improving Adversarial Robustness of Vision Transformers,Samyak Jain · Tanima Dutta, ,https://arxiv.org/html/2208.09602v2,,,Exploring Adversarial Robustness of Vision Transformers in the Spectral Perspective,"The Vision Transformer has emerged as a powerful tool for image +classification tasks, surpassing the performance of convolutional neural +networks (CNNs). Recently, many researchers have attempted to understand the +robustness of Transformers against adversarial attacks. However, previous +researches have focused solely on perturbations in the spatial domain. This +paper proposes an additional perspective that explores the adversarial +robustness of Transformers against frequency-selective perturbations in the +spectral domain. To facilitate comparison between these two domains, an attack +framework is formulated as a flexible tool for implementing attacks on images +in the spatial and spectral domains. The experiments reveal that Transformers +rely more on phase and low frequency information, which can render them more +vulnerable to frequency-selective attacks than CNNs. This work offers new +insights into the properties and adversarial robustness of Transformers.",cs.CV,['cs.CV'] +On Train-Test Class Overlap and Detection for Image Retrieval,Chull Hwan Song · Jooyoung Yoon · Taebaek Hwang · Shunghyun Choi · Yeong Hyeon Gu · Yannis Avrithis, ,https://arxiv.org/abs/2404.01524,,2404.01524.pdf,On Train-Test Class Overlap and Detection for Image Retrieval,"How important is it for training and evaluation sets to not have class +overlap in image retrieval? We revisit Google Landmarks v2 clean, the most +popular training set, by identifying and removing class overlap with Revisited +Oxford and Paris [34], the most popular evaluation set. By comparing the +original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art +methods, our findings are striking. Not only is there a dramatic drop in +performance, but it is inconsistent across methods, changing the ranking.What +does it take to focus on objects or interest and ignore background clutter when +indexing? Do we need to train an object detector and the representation +separately? Do we need location supervision? We introduce Single-stage +Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect +objects of interest and extract a global image representation. We outperform +previous state-of-the-art on both existing training sets and the new +RGLDv2-clean. Our dataset is available at +https://github.com/dealicious-inc/RGLDv2-clean.",cs.CV,"['cs.CV', 'cs.AI']" +Semantic Line Combination Detector,JINWON KO · Dongkwon Jin · Chang-Su Kim, ,https://arxiv.org/abs/2404.18399,,2404.18399.pdf,Semantic Line Combination Detector,"A novel algorithm, called semantic line combination detector (SLCD), to find +an optimal combination of semantic lines is proposed in this paper. It +processes all lines in each line combination at once to assess the overall +harmony of the lines. First, we generate various line combinations from +reliable lines. Second, we estimate the score of each line combination and +determine the best one. Experimental results demonstrate that the proposed SLCD +outperforms existing semantic line detectors on various datasets. Moreover, it +is shown that SLCD can be applied effectively to three vision tasks of +vanishing point detection, symmetry axis detection, and composition-based image +retrieval. Our codes are available at https://github.com/Jinwon-Ko/SLCD.",cs.CV,['cs.CV'] +Robust Noisy Correspondence Learning with Equivariant Similarity Consistency,Yuchen Yang · Erkun Yang · Likai Wang · Cheng Deng, ,,https://dl.acm.org/doi/10.1145/3662732,,,,,nan +Event-based Structure-from-Orbit,Ethan Elms · Yasir Latif · Tae Ha Park · Tat-Jun Chin, ,https://arxiv.org/abs/2405.06216,,2405.06216.pdf,Event-based Structure-from-Orbit,"Event sensors offer high temporal resolution visual sensing, which makes them +ideal for perceiving fast visual phenomena without suffering from motion blur. +Certain applications in robotics and vision-based navigation require 3D +perception of an object undergoing circular or spinning motion in front of a +static camera, such as recovering the angular velocity and shape of the object. +The setting is equivalent to observing a static object with an orbiting camera. +In this paper, we propose event-based structure-from-orbit (eSfO), where the +aim is to simultaneously reconstruct the 3D structure of a fast spinning object +observed from a static event camera, and recover the equivalent orbital motion +of the camera. Our contributions are threefold: since state-of-the-art event +feature trackers cannot handle periodic self-occlusion due to the spinning +motion, we develop a novel event feature tracker based on spatio-temporal +clustering and data association that can better track the helical trajectories +of valid features in the event data. The feature tracks are then fed to our +novel factor graph-based structure-from-orbit back-end that calculates the +orbital motion parameters (e.g., spin rate, relative rotational axis) that +minimize the reprojection error. For evaluation, we produce a new event dataset +of objects under spinning motion. Comparisons against ground truth indicate the +efficacy of eSfO.",cs.CV,['cs.CV'] +HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions,Hao Xu · Li Haipeng · Yinqiao Wang · Shuaicheng Liu · Chi-Wing Fu, ,https://arxiv.org/abs/2403.18575,,2403.18575.pdf,HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions,"Reconstructing 3D hand mesh robustly from a single image is very challenging, +due to the lack of diversity in existing real-world datasets. While data +synthesis helps relieve the issue, the syn-to-real gap still hinders its usage. +In this work, we present HandBooster, a new approach to uplift the data +diversity and boost the 3D hand-mesh reconstruction performance by training a +conditional generative space on hand-object interactions and purposely sampling +the space to synthesize effective data samples. First, we construct versatile +content-aware conditions to guide a diffusion model to produce realistic images +with diverse hand appearances, poses, views, and backgrounds; favorably, +accurate 3D annotations are obtained for free. Then, we design a novel +condition creator based on our similarity-aware distribution sampling +strategies to deliberately find novel and realistic interaction poses that are +distinctive from the training set. Equipped with our method, several baselines +can be significantly improved beyond the SOTA on the HO3D and DexYCB +benchmarks. Our code will be released on +https://github.com/hxwork/HandBooster_Pytorch.",cs.CV,['cs.CV'] +Customization Assistant for Text-to-image Generation,Yufan Zhou · Ruiyi Zhang · Jiuxiang Gu · Tong Sun, ,https://arxiv.org/abs/2312.03045,,2312.03045.pdf,Customization Assistant for Text-to-image Generation,"Customizing pre-trained text-to-image generation model has attracted massive +research interest recently, due to its huge potential in real-world +applications. Although existing methods are able to generate creative content +for a novel concept contained in single user-input image, their capability are +still far from perfection. Specifically, most existing methods require +fine-tuning the generative model on testing images. Some existing methods do +not require fine-tuning, while their performance are unsatisfactory. +Furthermore, the interaction between users and models are still limited to +directive and descriptive prompts such as instructions and captions. In this +work, we build a customization assistant based on pre-trained large language +model and diffusion model, which can not only perform customized generation in +a tuning-free manner, but also enable more user-friendly interactions: users +can chat with the assistant and input either ambiguous text or clear +instruction. Specifically, we propose a new framework consists of a new model +design and a novel training strategy. The resulting assistant can perform +customized generation in 2-5 seconds without any test time fine-tuning. +Extensive experiments are conducted, competitive results have been obtained +across different domains, illustrating the effectiveness of the proposed +method.",cs.CV,['cs.CV'] +Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering,Jiawei Yao · Qi Qian · Juhua Hu,https://github.com/Alexander-Yao/Multi-MaP,https://arxiv.org/abs/2404.15655,,2404.15655.pdf,Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering,"Multiple clustering has gained significant attention in recent years due to +its potential to reveal multiple hidden structures of data from different +perspectives. The advent of deep multiple clustering techniques has notably +advanced the performance by uncovering complex patterns and relationships +within large datasets. However, a major challenge arises as users often do not +need all the clusterings that algorithms generate, and figuring out the one +needed requires a substantial understanding of each clustering result. +Traditionally, aligning a user's brief keyword of interest with the +corresponding vision components was challenging, but the emergence of +multi-modal and large language models (LLMs) has begun to bridge this gap. In +response, given unlabeled target visual data, we propose Multi-MaP, a novel +method employing a multi-modal proxy learning process. It leverages CLIP +encoders to extract coherent text and image embeddings, with GPT-4 integrating +users' interests to formulate effective textual contexts. Moreover, reference +word constraint and concept-level constraint are designed to learn the optimal +text proxy according to the user's interest. Multi-MaP not only adeptly +captures a user's interest via a keyword but also facilitates identifying +relevant clusterings. Our extensive experiments show that Multi-MaP +consistently outperforms state-of-the-art methods in all benchmark +multi-clustering vision tasks. Our code is available at +https://github.com/Alexander-Yao/Multi-MaP.",cs.CV,['cs.CV'] +Anchor-based Robust Finetuning of Vision-Language Models,Jinwei Han · Zhiwen Lin · Zhongyisun Sun · Yingguo Gao · Ke Yan · Shouhong Ding · Yuan Gao · Gui-Song Xia,https://github.com/LixDemon/ARF,https://arxiv.org/abs/2404.06244,,2404.06244.pdf,Anchor-based Robust Finetuning of Vision-Language Models,"We aim at finetuning a vision-language model without hurting its +out-of-distribution (OOD) generalization. We address two types of OOD +generalization, i.e., i) domain shift such as natural to sketch images, and ii) +zero-shot capability to recognize the category that was not contained in the +finetune data. Arguably, the diminished OOD generalization after finetuning +stems from the excessively simplified finetuning target, which only provides +the class information, such as ``a photo of a [CLASS]''. This is distinct from +the process in that CLIP was pretrained, where there is abundant text +supervision with rich semantic information. Therefore, we propose to compensate +for the finetune process using auxiliary supervision with rich semantic +information, which acts as anchors to preserve the OOD generalization. +Specifically, two types of anchors are elaborated in our method, including i) +text-compensated anchor which uses the images from the finetune set but +enriches the text supervision from a pretrained captioner, ii) image-text-pair +anchor which is retrieved from the dataset similar to pretraining data of CLIP +according to the downstream task, associating with the original CLIP text with +rich semantics. Those anchors are utilized as auxiliary semantic information to +maintain the original feature space of CLIP, thereby preserving the OOD +generalization capabilities. Comprehensive experiments demonstrate that our +method achieves in-distribution performance akin to conventional finetuning +while attaining new state-of-the-art results on domain shift and zero-shot +learning benchmarks.",cs.CV,['cs.CV'] +LEAD: Exploring Logit Space Evolution for Model Selection,Zixuan Hu · Xiaotong Li · SHIXIANG TANG · Jun Liu · Yichun Hu · Ling-Yu Duan, ,https://arxiv.org/abs/2308.15074,,2308.15074.pdf,Exploring Model Transferability through the Lens of Potential Energy,"Transfer learning has become crucial in computer vision tasks due to the vast +availability of pre-trained deep learning models. However, selecting the +optimal pre-trained model from a diverse pool for a specific downstream task +remains a challenge. Existing methods for measuring the transferability of +pre-trained models rely on statistical correlations between encoded static +features and task labels, but they overlook the impact of underlying +representation dynamics during fine-tuning, leading to unreliable results, +especially for self-supervised models. In this paper, we present an insightful +physics-inspired approach named PED to address these challenges. We reframe the +challenge of model selection through the lens of potential energy and directly +model the interaction forces that influence fine-tuning dynamics. By capturing +the motion of dynamic representations to decline the potential energy within a +force-driven physical model, we can acquire an enhanced and more stable +observation for estimating transferability. The experimental results on 10 +downstream tasks and 12 self-supervised models demonstrate that our approach +can seamlessly integrate into existing ranking techniques and enhance their +performances, revealing its effectiveness for the model selection task and its +potential for understanding the mechanism in transfer learning. Code will be +available at https://github.com/lixiaotong97/PED.",cs.CV,"['cs.CV', 'cs.LG']" +Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions,Weizhen He · Yiheng Deng · SHIXIANG TANG · Qihao CHEN · Qingsong Xie · Yizhou Wang · Lei Bai · Feng Zhu · Rui Zhao · Wanli Ouyang · Donglian Qi · Yunfeng Yan, ,https://arxiv.org/abs/2306.07520,,2306.07520.pdf,Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions,"Human intelligence can retrieve any person according to both visual and +language descriptions. However, the current computer vision community studies +specific person re-identification (ReID) tasks in different scenarios +separately, which limits the applications in the real world. This paper strives +to resolve this problem by proposing a new instruct-ReID task that requires the +model to retrieve images according to the given image or language instructions. +Our instruct-ReID is a more general ReID setting, where existing 6 ReID tasks +can be viewed as special cases by designing different instructions. We propose +a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline +method to facilitate research in this new setting. Experimental results show +that the proposed multi-purpose ReID model, trained on our OmniReID benchmark +without fine-tuning, can improve +0.5%, +0.6%, +7.7% mAP on Market1501, MSMT17, +CUHK03 for traditional ReID, +6.4%, +7.1%, +11.2% mAP on PRCC, VC-Clothes, LTCC +for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothes template +based clothes-changing ReID when using only RGB images, +24.9% mAP on COCAS+ +real2 for our newly defined language-instructed ReID, +4.3% on LLCM for +visible-infrared ReID, +2.6% on CUHK-PEDES for text-to-image ReID. The +datasets, the model, and code will be available at +https://github.com/hwz-zju/Instruct-ReID.",cs.CV,['cs.CV'] +Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation,Siteng Huang · Biao Gong · Yutong Feng · Xi Chen · Yuqian Fu · Yu Liu · Donglin Wang, ,https://arxiv.org/abs/2311.15841,,2311.15841.pdf,Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation,"This study focuses on a novel task in text-to-image (T2I) generation, namely +action customization. The objective of this task is to learn the co-existing +action from limited data and generalize it to unseen humans or even animals. +Experimental results show that existing subject-driven customization methods +fail to learn the representative characteristics of actions and struggle in +decoupling actions from context features, including appearance. To overcome the +preference for low-level features and the entanglement of high-level features, +we propose an inversion-based method Action-Disentangled Identifier (ADI) to +learn action-specific identifiers from the exemplar images. ADI first expands +the semantic conditioning space by introducing layer-wise identifier tokens, +thereby increasing the representational richness while distributing the +inversion across different features. Then, to block the inversion of +action-agnostic features, ADI extracts the gradient invariance from the +constructed sample triples and masks the updates of irrelevant channels. To +comprehensively evaluate the task, we present an ActionBench that includes a +variety of actions, each accompanied by meticulously selected samples. Both +quantitative and qualitative results show that our ADI outperforms existing +baselines in action-customized T2I generation. Our project page is at +https://adi-t2i.github.io/ADI.",cs.CV,['cs.CV'] +ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models,Xinyu Tian · Shu Zou · Zhaoyuan Yang · Jing Zhang, ,https://arxiv.org/abs/2311.16494,,2311.16494.pdf,ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models,"Although soft prompt tuning is effective in efficiently adapting +Vision-Language (V&L) models for downstream tasks, it shows limitations in +dealing with distribution shifts. We address this issue with Attribute-Guided +Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the +conventional approach of directly appending soft prompts preceding class names, +we align the model with primitive visual attributes generated by Large Language +Models (LLMs). We posit that a model's ability to express high confidence in +these attributes signifies its capacity to discern the correct class +rationales. 2) We introduce attribute sampling to eliminate disadvantageous +attributes, thus only semantically meaningful attributes are preserved. 3) We +propose negative prompting, explicitly enumerating class-agnostic attributes to +activate spurious correlations and encourage the model to generate highly +orthogonal probability distributions in relation to these negative features. In +experiments, our method significantly outperforms current state-of-the-art +prompt tuning methods on both novel class prediction and out-of-distribution +generalization tasks.",cs.CV,['cs.CV'] +Narrative Action Evaluation with Prompt-Guided Multimodal Interaction,Shiyi Zhang · Sule Bai · Guangyi Chen · Lei Chen · Jiwen Lu · Junle Wang · Yansong Tang,https://github.com/shiyi-zh0408/NAE_CVPR2024,https://arxiv.org/abs/2404.14471,,2404.14471.pdf,Narrative Action Evaluation with Prompt-Guided Multimodal Interaction,"In this paper, we investigate a new problem called narrative action +evaluation (NAE). NAE aims to generate professional commentary that evaluates +the execution of an action. Unlike traditional tasks such as score-based action +quality assessment and video captioning involving superficial sentences, NAE +focuses on creating detailed narratives in natural language. These narratives +provide intricate descriptions of actions along with objective evaluations. NAE +is a more challenging task because it requires both narrative flexibility and +evaluation rigor. One existing possible solution is to use multi-task learning, +where narrative language and evaluative information are predicted separately. +However, this approach results in reduced performance for individual tasks +because of variations between tasks and differences in modality between +language information and evaluation information. To address this, we propose a +prompt-guided multimodal interaction framework. This framework utilizes a pair +of transformers to facilitate the interaction between different modalities of +information. It also uses prompts to transform the score regression task into a +video-text matching task, thus enabling task interactivity. To support further +research in this field, we re-annotate the MTL-AQA and FineGym datasets with +high-quality and comprehensive action narration. Additionally, we establish +benchmarks for NAE. Extensive experiment results prove that our method +outperforms separate learning methods and naive multi-task learning methods. +Data and code are released at https://github.com/shiyi-zh0408/NAE_CVPR2024.",cs.CV,['cs.CV'] +Improved Implicit Neural Representation with Fourier Reparameterized Training,Kexuan Shi · Xingyu Zhou · Shuhang Gu, ,https://arxiv.org/abs/2401.07402,,2401.07402.pdf,Improved Implicit Neural Representation with Fourier Bases Reparameterized Training,"Implicit Neural Representation (INR) as a mighty representation paradigm has +achieved success in various computer vision tasks recently. Due to the +low-frequency bias issue of vanilla multi-layer perceptron (MLP), existing +methods have investigated advanced techniques, such as positional encoding and +periodic activation function, to improve the accuracy of INR. In this paper, we +connect the network training bias with the reparameterization technique and +theoretically prove that weight reparameterization could provide us a chance to +alleviate the spectral bias of MLP. Based on our theoretical analysis, we +propose a Fourier reparameterization method which learns coefficient matrix of +fixed Fourier bases to compose the weights of MLP. We evaluate the proposed +Fourier reparameterization method on different INR tasks with various MLP +architectures, including vanilla MLP, MLP with positional encoding and MLP with +advanced activation function, etc. The superiority approximation results on +different MLP architectures clearly validate the advantage of our proposed +method. Armed with our Fourier reparameterization method, better INR with more +textures and less artifacts can be learned from the training data.",cs.CV,['cs.CV'] +"Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications",Karren Yang · Anurag Ranjan · Jen-Hao Rick Chang · Raviteja Vemulapalli · Oncel Tuzel, ,https://arxiv.org/abs/2311.18168,,2311.18168.pdf,"Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications","We consider the task of animating 3D facial geometry from speech signal. +Existing works are primarily deterministic, focusing on learning a one-to-one +mapping from speech signal to 3D face meshes on small datasets with limited +speakers. While these models can achieve high-quality lip articulation for +speakers in the training set, they are unable to capture the full and diverse +distribution of 3D facial motions that accompany speech in the real world. +Importantly, the relationship between speech and facial motion is one-to-many, +containing both inter-speaker and intra-speaker variations and necessitating a +probabilistic approach. In this paper, we identify and address key challenges +that have so far limited the development of probabilistic models: lack of +datasets and metrics that are suitable for training and evaluating them, as +well as the difficulty of designing a model that generates diverse results +while remaining faithful to a strong conditioning signal as speech. We first +propose large-scale benchmark datasets and metrics suitable for probabilistic +modeling. Then, we demonstrate a probabilistic model that achieves both +diversity and fidelity to speech, outperforming other methods across the +proposed benchmarks. Finally, we showcase useful applications of probabilistic +models trained on these large-scale datasets: we can generate diverse +speech-driven 3D facial motion that matches unseen speaker styles extracted +from reference clips; and our synthetic meshes can be used to improve the +performance of downstream audio-visual models.",cs.CV,"['cs.CV', 'cs.LG', 'eess.AS']" +EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting,Zitao Wang · Qiguang Miao · Yue Xi · Peipei Zhao, ,https://arxiv.org/abs/2308.12831,,2308.12831.pdf,EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting,"The portrait matting task aims to extract an alpha matte with complete +semantics and finely-detailed contours. In comparison to CNN-based approaches, +transformers with self-attention module have a better capacity to capture +long-range dependencies and low-frequency semantic information of a portrait. +However, the recent research shows that self-attention mechanism struggles with +modeling high-frequency contour information and capturing fine contour details, +which can lead to bias while predicting the portrait's contours. To deal with +this issue, we propose EFormer to enhance the model's attention towards both of +the low-frequency semantic and high-frequency contour features. For the +high-frequency contours, our research demonstrates that cross-attention module +between different resolutions can guide our model to allocate attention +appropriately to these contour regions. Supported on this, we can successfully +extract the high-frequency detail information around the portrait's contours, +which are previously ignored by self-attention. Based on cross-attention +module, we further build a semantic and contour detector (SCD) to accurately +capture both of the low-frequency semantic and high-frequency contour features. +And we design contour-edge extraction branch and semantic extraction branch to +extract refined high-frequency contour features and complete low-frequency +semantic information, respectively. Finally, we fuse the two kinds of features +and leverage segmentation head to generate a predicted portrait matte. +Experiments on VideoMatte240K (JPEG SD Format) and Adobe Image Matting (AIM) +datasets demonstrate that EFormer outperforms previous portrait matte methods.",cs.CV,['cs.CV'] +Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation,Bingfeng Zhang · Siyue Yu · Yunchao Wei · Yao Zhao · Jimin Xiao, ,https://arxiv.org/html/2405.14294v1,,2405.14294v1.pdf,Tuning-free Universally-Supervised Semantic Segmentation,"This work presents a tuning-free semantic segmentation framework based on +classifying SAM masks by CLIP, which is universally applicable to various types +of supervision. Initially, we utilize CLIP's zero-shot classification ability +to generate pseudo-labels or perform open-vocabulary segmentation. However, the +misalignment between mask and CLIP text embeddings leads to suboptimal results. +To address this issue, we propose discrimination-bias aligned CLIP to closely +align mask and text embedding, offering an overhead-free performance gain. We +then construct a global-local consistent classifier to classify SAM masks, +which reveals the intrinsic structure of high-quality embeddings produced by +DBA-CLIP and demonstrates robustness against noisy pseudo-labels. Extensive +experiments validate the efficiency and effectiveness of our method, and we +achieve state-of-the-art (SOTA) or competitive performance across various +datasets and supervision types.",cs.CV,['cs.CV'] +A Simple Baseline for Efficient Hand Mesh Reconstruction,zhishan zhou · shihao zhou · Zhi Lv · minqiang zou · Yao Tang · Jiajun Liang,https://simplehand.github.io/,https://arxiv.org/abs/2403.01813,,2403.01813.pdf,A Simple Baseline for Efficient Hand Mesh Reconstruction,"3D hand pose estimation has found broad application in areas such as gesture +recognition and human-machine interaction tasks. As performance improves, the +complexity of the systems also increases, which can limit the comparative +analysis and practical implementation of these methods. In this paper, we +propose a simple yet effective baseline that not only surpasses +state-of-the-art (SOTA) methods but also demonstrates computational efficiency. +To establish this baseline, we abstract existing work into two components: a +token generator and a mesh regressor, and then examine their core structures. A +core structure, in this context, is one that fulfills intrinsic functions, +brings about significant improvements, and achieves excellent performance +without unnecessary complexities. Our proposed approach is decoupled from any +modifications to the backbone, making it adaptable to any modern models. Our +method outperforms existing solutions, achieving state-of-the-art (SOTA) +results across multiple datasets. On the FreiHAND dataset, our approach +produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb +dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for +performance speed, our method reached up to 33 frames per second (fps) when +using HRNet and up to 70 fps when employing FastViT-MA36",cs.CV,['cs.CV'] +Score-Guided Diffusion for 3D Human Recovery,Anastasis Stathopoulos · Ligong Han · Dimitris N. Metaxas,https://statho.github.io/ScoreHMR/,http://export.arxiv.org/abs/2403.09623,,2403.09623.pdf,Score-Guided Diffusion for 3D Human Recovery,"We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for +solving inverse problems for 3D human pose and shape reconstruction. These +inverse problems involve fitting a human body model to image observations, +traditionally solved through optimization techniques. ScoreHMR mimics model +fitting approaches, but alignment with the image observation is achieved +through score guidance in the latent space of a diffusion model. The diffusion +model is trained to capture the conditional distribution of the human model +parameters given an input image. By guiding its denoising process with a +task-specific score, ScoreHMR effectively solves inverse problems for various +applications without the need for retraining the task-agnostic diffusion model. +We evaluate our approach on three settings/applications. These are: (i) +single-frame model fitting; (ii) reconstruction from multiple uncalibrated +views; (iii) reconstructing humans in video sequences. ScoreHMR consistently +outperforms all optimization baselines on popular benchmarks across all +settings. We make our code and models available at the +https://statho.github.io/ScoreHMR.",cs.CV,['cs.CV'] +Diversified and Personalized Multi-rater Medical Image Segmentation,Yicheng Wu · Xiangde Luo · Zhe Xu · Xiaoqing Guo · Lie Ju · Zongyuan Ge · Wenjun Liao · Jianfei Cai,https://github.com/ycwu1997/D-Persona,https://arxiv.org/abs/2403.13417,,2403.13417.pdf,Diversified and Personalized Multi-rater Medical Image Segmentation,"Annotation ambiguity due to inherent data uncertainties such as blurred +boundaries in medical scans and different observer expertise and preferences +has become a major obstacle for training deep-learning based medical image +segmentation models. To address it, the common practice is to gather multiple +annotations from different experts, leading to the setting of multi-rater +medical image segmentation. Existing works aim to either merge different +annotations into the ""groundtruth"" that is often unattainable in numerous +medical contexts, or generate diverse results, or produce personalized results +corresponding to individual expert raters. Here, we bring up a more ambitious +goal for multi-rater medical image segmentation, i.e., obtaining both +diversified and personalized results. Specifically, we propose a two-stage +framework named D-Persona (first Diversification and then Personalization). In +Stage I, we exploit multiple given annotations to train a Probabilistic U-Net +model, with a bound-constrained loss to improve the prediction diversity. In +this way, a common latent space is constructed in Stage I, where different +latent codes denote diversified expert opinions. Then, in Stage II, we design +multiple attention-based projection heads to adaptively query the corresponding +expert prompts from the shared latent space, and then perform the personalized +medical image segmentation. We evaluated the proposed model on our in-house +Nasopharyngeal Carcinoma dataset and the public lung nodule dataset (i.e., +LIDC-IDRI). Extensive experiments demonstrated our D-Persona can provide +diversified and personalized results at the same time, achieving new SOTA +performance for multi-rater medical image segmentation. Our code will be +released at https://github.com/ycwu1997/D-Persona.",cs.CV,['cs.CV'] +AnyDoor: Zero-shot Object-level Image Customization,Xi Chen · Lianghua Huang · Yu Liu · Yujun Shen · Deli Zhao · Hengshuang Zhao, ,https://arxiv.org/abs/2307.09481,,2307.09481.pdf,AnyDoor: Zero-shot Object-level Image Customization,"This work presents AnyDoor, a diffusion-based image generator with the power +to teleport target objects to new scenes at user-specified locations in a +harmonious way. Instead of tuning parameters for each object, our model is +trained only once and effortlessly generalizes to diverse object-scene +combinations at the inference stage. Such a challenging zero-shot setting +requires an adequate characterization of a certain object. To this end, we +complement the commonly used identity feature with detail features, which are +carefully designed to maintain texture details yet allow versatile local +variations (e.g., lighting, orientation, posture, etc.), supporting the object +in favorably blending with different surroundings. We further propose to borrow +knowledge from video datasets, where we can observe various forms (i.e., along +the time axis) of a single object, leading to stronger model generalizability +and robustness. Extensive experiments demonstrate the superiority of our +approach over existing alternatives as well as its great potential in +real-world applications, such as virtual try-on and object moving. Project page +is https://damo-vilab.github.io/AnyDoor-Page/.",cs.CV,['cs.CV'] +Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?,Hanxin Zhu · Tianyu He · Xin Li · Bingchen Li · Zhibo Chen, ,https://arxiv.org/abs/2403.06092,,2403.06092.pdf,Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?,"Neural Radiance Field (NeRF) has achieved superior performance for novel view +synthesis by modeling the scene with a Multi-Layer Perception (MLP) and a +volume rendering procedure, however, when fewer known views are given (i.e., +few-shot view synthesis), the model is prone to overfit the given views. To +handle this issue, previous efforts have been made towards leveraging learned +priors or introducing additional regularizations. In contrast, in this paper, +we for the first time provide an orthogonal method from the perspective of +network structure. Given the observation that trivially reducing the number of +model parameters alleviates the overfitting issue, but at the cost of missing +details, we propose the multi-input MLP (mi-MLP) that incorporates the inputs +(i.e., location and viewing direction) of the vanilla MLP into each layer to +prevent the overfitting issue without harming detailed synthesis. To further +reduce the artifacts, we propose to model colors and volume density separately +and present two regularization terms. Extensive experiments on multiple +datasets demonstrate that: 1) although the proposed mi-MLP is easy to +implement, it is surprisingly effective as it boosts the PSNR of the baseline +from $14.73$ to $24.23$. 2) the overall framework achieves state-of-the-art +results on a wide range of benchmarks. We will release the code upon +publication.",cs.CV,['cs.CV'] +PIGEON: Predicting Image Geolocations,Lukas Haas · Michal Skreta · Silas Alberti · Chelsea Finn,https://lukashaas.github.io/PIGEON-CVPR24/,,https://huggingface.co/papers/2307.05845,,,,,nan +Nearest Is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks,Boheng Li · Yishuo Cai · Haowei Li · Feng Xue · Zhifeng Li · Yiming Li, ,https://arxiv.org/abs/2405.12725,,2405.12725.pdf,Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks,"Model quantization is widely used to compress and accelerate deep neural +networks. However, recent studies have revealed the feasibility of weaponizing +model quantization via implanting quantization-conditioned backdoors (QCBs). +These special backdoors stay dormant on released full-precision models but will +come into effect after standard quantization. Due to the peculiarity of QCBs, +existing defenses have minor effects on reducing their threats or are even +infeasible. In this paper, we conduct the first in-depth analysis of QCBs. We +reveal that the activation of existing QCBs primarily stems from the nearest +rounding operation and is closely related to the norms of neuron-wise +truncation errors (i.e., the difference between the continuous full-precision +weights and its quantized version). Motivated by these insights, we propose +Error-guided Flipped Rounding with Activation Preservation (EFRAP), an +effective and practical defense against QCBs. Specifically, EFRAP learns a +non-nearest rounding strategy with neuron-wise error norm and layer-wise +activation preservation guidance, flipping the rounding strategies of neurons +crucial for backdoor effects but with minimal impact on clean accuracy. +Extensive evaluations on benchmark datasets demonstrate that our EFRAP can +defeat state-of-the-art QCB attacks under various settings. Code is available +at https://github.com/AntigoneRandy/QuantBackdoor_EFRAP.",cs.CR,"['cs.CR', 'cs.CV']" +Interactive3D: Create What You Want by Interactive 3D Generation,Shaocong Dong · Lihe Ding · Zhanpeng Huang · Zibin Wang · Tianfan Xue · Dan Xu, ,https://arxiv.org/abs/2404.16510,,2404.16510.pdf,Interactive3D: Create What You Want by Interactive 3D Generation,"3D object generation has undergone significant advancements, yielding +high-quality results. However, fall short of achieving precise user control, +often yielding results that do not align with user expectations, thus limiting +their applicability. User-envisioning 3D object generation faces significant +challenges in realizing its concepts using current generative models due to +limited interaction capabilities. Existing methods mainly offer two approaches: +(i) interpreting textual instructions with constrained controllability, or (ii) +reconstructing 3D objects from 2D images. Both of them limit customization to +the confines of the 2D reference and potentially introduce undesirable +artifacts during the 3D lifting process, restricting the scope for direct and +versatile 3D modifications. In this work, we introduce Interactive3D, an +innovative framework for interactive 3D generation that grants users precise +control over the generative process through extensive 3D interaction +capabilities. Interactive3D is constructed in two cascading stages, utilizing +distinct 3D representations. The first stage employs Gaussian Splatting for +direct user interaction, allowing modifications and guidance of the generative +direction at any intermediate step through (i) Adding and Removing components, +(ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) +Semantic Editing. Subsequently, the Gaussian splats are transformed into +InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to +further add details and extract the geometry in the second stage. Our +experiments demonstrate that Interactive3D markedly improves the +controllability and quality of 3D generation. Our project webpage is available +at \url{https://interactive-3d.github.io/}.",cs.GR,"['cs.GR', 'cs.CV']" +Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models,Matthew Kowal · Richard P. Wildes · Kosta Derpanis,https://yorkucvil.github.io/VCC/,https://arxiv.org/abs/2404.02233,,2404.02233.pdf,Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models,"Understanding what deep network models capture in their learned +representations is a fundamental challenge in computer vision. We present a new +methodology to understanding such vision models, the Visual Concept Connectome +(VCC), which discovers human interpretable concepts and their interlayer +connections in a fully unsupervised manner. Our approach simultaneously reveals +fine-grained concepts at a layer, connection weightings across all layers and +is amendable to global analysis of network structure (e.g., branching pattern +of hierarchical concept assemblies). Previous work yielded ways to extract +interpretable concepts from single layers and examine their impact on +classification, but did not afford multilayer concept analysis across an entire +network architecture. Quantitative and qualitative empirical results show the +effectiveness of VCCs in the domain of image classification. Also, we leverage +VCCs for the application of failure mode debugging to reveal where mistakes +arise in deep networks.",cs.CV,['cs.CV'] +GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding,Hao Li · Dingwen Zhang · Yalun Dai · Nian Liu · Lechao Cheng · Li Jingfeng · Jingdong Wang · Junwei Han, ,https://arxiv.org/abs/2311.11863,,2311.11863.pdf,GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding,"Applying NeRF to downstream perception tasks for scene understanding and +representation is becoming increasingly popular. Most existing methods treat +semantic prediction as an additional rendering task, \textit{i.e.}, the ""label +rendering"" task, to build semantic NeRFs. However, by rendering +semantic/instance labels per pixel without considering the contextual +information of the rendered image, these methods usually suffer from unclear +boundary segmentation and abnormal segmentation of pixels within an object. To +solve this problem, we propose Generalized Perception NeRF (GP-NeRF), a novel +pipeline that makes the widely used segmentation model and NeRF work compatibly +under a unified framework, for facilitating context-aware 3D scene perception. +To accomplish this goal, we introduce transformers to aggregate radiance as +well as semantic embedding fields jointly for novel views and facilitate the +joint volumetric rendering of both fields. In addition, we propose two +self-distillation mechanisms, i.e., the Semantic Distill Loss and the +Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality +of the semantic field and the maintenance of geometric consistency. In +evaluation, we conduct experimental comparisons under two perception tasks +(\textit{i.e.} semantic and instance segmentation) using both synthetic and +real-world datasets. Notably, our method outperforms SOTA approaches by 6.94\%, +11.76\%, and 8.47\% on generalized semantic segmentation, finetuning semantic +segmentation, and instance segmentation, respectively.",cs.CV,['cs.CV'] +Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception,Lei Fan · Mingfu Liang · Yunxuan Li · Gang Hua · Ying Wu, ,https://arxiv.org/abs/2311.13793,,2311.13793.pdf,Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception,"Active recognition enables robots to intelligently explore novel +observations, thereby acquiring more information while circumventing undesired +viewing conditions. Recent approaches favor learning policies from simulated or +collected data, wherein appropriate actions are more frequently selected when +the recognition is accurate. However, most recognition modules are developed +under the closed-world assumption, which makes them ill-equipped to handle +unexpected inputs, such as the absence of the target object in the current +observation. To address this issue, we propose treating active recognition as a +sequential evidence-gathering process, providing by-step uncertainty +quantification and reliable prediction under the evidence combination theory. +Additionally, the reward function developed in this paper effectively +characterizes the merit of actions when operating in open-world environments. +To evaluate the performance, we collect a dataset from an indoor simulator, +encompassing various recognition challenges such as distance, occlusion levels, +and visibility. Through a series of experiments on recognition and robustness +analysis, we demonstrate the necessity of introducing uncertainties to active +recognition and the superior performance of the proposed method.",cs.CV,"['cs.CV', 'cs.RO']" +CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition,Qixuan Zheng · Ming Zhang · Hong Yan, ,https://arxiv.org/abs/2402.16594,,2402.16594.pdf,CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition,"To achieve greater accuracy, hypergraph matching algorithms require +exponential increases in computational resources. Recent kd-tree-based +approximate nearest neighbor (ANN) methods, despite the sparsity of their +compatibility tensor, still require exhaustive calculations for large-scale +graph matching. This work utilizes CUR tensor decomposition and introduces a +novel cascaded second and third-order hypergraph matching framework (CURSOR) +for efficient hypergraph matching. A CUR-based second-order graph matching +algorithm is used to provide a rough match, and then the core of CURSOR, a +fiber-CUR-based tensor generation method, directly calculates entries of the +compatibility tensor by leveraging the initial second-order match result. This +significantly decreases the time complexity and tensor density. A probability +relaxation labeling (PRL)-based matching algorithm, especially suitable for +sparse tensors, is developed. Experiment results on large-scale synthetic +datasets and widely-adopted benchmark sets demonstrate the superiority of +CURSOR over existing methods. The tensor generation method in CURSOR can be +integrated seamlessly into existing hypergraph matching methods to improve +their performance and lower their computational costs.",cs.CV,['cs.CV'] +Total Selfie: Generating Full-Body Selfies,Bowei Chen · Brian Curless · Ira Kemelmacher-Shlizerman · Steve Seitz, ,https://arxiv.org/abs/2308.14740,,2308.14740.pdf,Total Selfie: Generating Full-Body Selfies,"We present a method to generate full-body selfies from photographs originally +taken at arms length. Because self-captured photos are typically taken close +up, they have limited field of view and exaggerated perspective that distorts +facial shapes. We instead seek to generate the photo some one else would take +of you from a few feet away. Our approach takes as input four selfies of your +face and body, a background image, and generates a full-body selfie in a +desired target pose. We introduce a novel diffusion-based approach to combine +all of this information into high-quality, well-composed photos of you with the +desired pose and background.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Novel View Synthesis with View-Dependent Effects from a Single Image,Juan Luis Gonzalez Bello · Munchurl Kim,https://kaist-viclab.github.io/monovde-site/,https://arxiv.org/abs/2312.08071v1,,2312.08071v1.pdf,Novel View Synthesis with View-Dependent Effects from a Single Image,"In this paper, we firstly consider view-dependent effects into single +image-based novel view synthesis (NVS) problems. For this, we propose to +exploit the camera motion priors in NVS to model view-dependent appearance or +effects (VDE) as the negative disparity in the scene. By recognizing +specularities ""follow"" the camera motion, we infuse VDEs into the input images +by aggregating input pixel colors along the negative depth region of the +epipolar lines. Also, we propose a `relaxed volumetric rendering' approximation +that allows computing the densities in a single pass, improving efficiency for +NVS from single images. Our method can learn single-image NVS from image +sequences only, which is a completely self-supervised learning method, for the +first time requiring neither depth nor camera pose annotations. We present +extensive experiment results and show that our proposed method can learn NVS +with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k +and MannequinChallenge datasets.",cs.CV,"['cs.CV', 'eess.IV']" +An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing,Feiran Hu · Chenlin Zhang · Jiangliang GUO · Xiu-Shen Wei · Lin Zhao · Anqi Xu · Lingyan Gao, ,,https://link.springer.com/article/10.1007/s11263-024-02009-7,,,,,nan +TRINS: Towards Multimodal Language Models That Can Read,Ruiyi Zhang · Yanzhe Zhang · Jian Chen · Yufan Zhou · Jiuxiang Gu · Changyou Chen · Tong Sun, ,https://arxiv.org/html/2401.10005v1,,2401.10005v1.pdf,Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation,"The increasing demand for intelligent systems capable of interpreting and +reasoning about visual content requires the development of Large Multi-Modal +Models (LMMs) that are not only accurate but also have explicit reasoning +capabilities. This paper presents a novel approach to imbue an LMM with the +ability to conduct explicit reasoning based on visual content and textual +instructions. We introduce a system that can ask a question to acquire +necessary knowledge, thereby enhancing the robustness and explicability of the +reasoning process. Our method comprises the development of a novel dataset +generated by a Large Language Model (LLM), designed to promote chain-of-thought +reasoning combined with a question-asking mechanism. We designed an LMM, which +has high capabilities on region awareness to address the intricate requirements +of image-text alignment. The model undergoes a three-stage training phase, +starting with large-scale image-text alignment using a large-scale datasets, +followed by instruction tuning, and fine-tuning with a focus on +chain-of-thought reasoning. The results demonstrate a stride toward a more +robust, accurate, and interpretable LMM, capable of reasoning explicitly and +seeking information proactively when confronted with ambiguous visual input.",cs.CV,"['cs.CV', 'cs.CL']" +DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors,Biwen Lei · Kai Yu · Mengyang Feng · Miaomiao Cui · Xuansong Xie, ,https://arxiv.org/abs/2312.16837,,2312.16837.pdf,DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors,"Text-guided domain adaptation and generation of 3D-aware portraits find many +applications in various fields. However, due to the lack of training data and +the challenges in handling the high variety of geometry and appearance, the +existing methods for these tasks suffer from issues like inflexibility, +instability, and low fidelity. In this paper, we propose a novel framework +DiffusionGAN3D, which boosts text-guided 3D domain adaptation and generation by +combining 3D GANs and diffusion priors. Specifically, we integrate the +pre-trained 3D generative models (e.g., EG3D) and text-to-image diffusion +models. The former provides a strong foundation for stable and high-quality +avatar generation from text. And the diffusion models in turn offer powerful +priors and guide the 3D generator finetuning with informative direction to +achieve flexible and efficient text-guided domain adaptation. To enhance the +diversity in domain adaptation and the generation capability in text-to-avatar, +we introduce the relative distance loss and case-specific learnable triplane +respectively. Besides, we design a progressive texture refinement module to +improve the texture quality for both tasks above. Extensive experiments +demonstrate that the proposed framework achieves excellent results in both +domain adaptation and text-to-avatar tasks, outperforming existing methods in +terms of generation quality and efficiency. The project homepage is at +https://younglbw.github.io/DiffusionGAN3D-homepage/.",cs.CV,['cs.CV'] +Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions,Saeed Khorram · Mingqi Jiang · Mohamad Shahbazi · Mohamad Hosein Danesh · Li Fuxin, ,https://arxiv.org/abs/2402.17065,,2402.17065.pdf,Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions,"Despite the extensive research on training generative adversarial networks +(GANs) with limited training data, learning to generate images from long-tailed +training distributions remains fairly unexplored. In the presence of imbalanced +multi-class training data, GANs tend to favor classes with more samples, +leading to the generation of low-quality and less diverse samples in tail +classes. In this study, we aim to improve the training of class-conditional +GANs with long-tailed data. We propose a straightforward yet effective method +for knowledge sharing, allowing tail classes to borrow from the rich +information from classes with more abundant training data. More concretely, we +propose modifications to existing class-conditional GAN architectures to ensure +that the lower-resolution layers of the generator are trained entirely +unconditionally while reserving class-conditional generation for the +higher-resolution layers. Experiments on several long-tail benchmarks and GAN +architectures demonstrate a significant improvement over existing methods in +both the diversity and fidelity of the generated images. The code is available +at https://github.com/khorrams/utlo.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection,Trevine Oorloff · Surya Koppisetti · Nicolo Bonettini · Divyaraj Solanki · Ben Colman · Yaser Yacoob · Ali Shahriyari · Gaurav Bharaj, ,https://arxiv.org/abs/2310.03827,,2310.03827.pdf,Integrating Audio-Visual Features for Multimodal Deepfake Detection,"Deepfakes are AI-generated media in which an image or video has been +digitally modified. The advancements made in deepfake technology have led to +privacy and security issues. Most deepfake detection techniques rely on the +detection of a single modality. Existing methods for audio-visual detection do +not always surpass that of the analysis based on single modalities. Therefore, +this paper proposes an audio-visual-based method for deepfake detection, which +integrates fine-grained deepfake identification with binary classification. We +categorize the samples into four types by combining labels specific to each +single modality. This method enhances the detection under intra-domain and +cross-domain testing.",cs.CV,['cs.CV'] +Masked AutoDecoder is Effective Multi-Task Vision Generalist,Han Qiu · Jiaxing Huang · Peng Gao · Lewei Lu · Xiaoqin Zhang · Shijian Lu, ,https://arxiv.org/abs/2403.07692,,2403.07692.pdf,Masked AutoDecoder is Effective Multi-Task Vision Generalist,"Inspired by the success of general-purpose models in NLP, recent studies +attempt to unify different vision tasks in the same sequence format and employ +autoregressive Transformers for sequence prediction. They apply uni-directional +attention to capture sequential dependencies and generate task sequences +recursively. However, such autoregressive Transformers may not fit vision tasks +well, as vision task sequences usually lack the sequential dependencies +typically observed in natural languages. In this work, we design Masked +AutoDecoder~(MAD), an effective multi-task vision generalist. MAD consists of +two core designs. First, we develop a parallel decoding framework that +introduces bi-directional attention to capture contextual dependencies +comprehensively and decode vision task sequences in parallel. Second, we design +a masked sequence modeling approach that learns rich task contexts by masking +and reconstructing task sequences. In this way, MAD handles all the tasks by a +single network branch and a simple cross-entropy loss with minimal +task-specific designs. Extensive experiments demonstrate the great potential of +MAD as a new paradigm for unifying various vision tasks. MAD achieves superior +performance and inference efficiency compared to autoregressive counterparts +while obtaining competitive accuracy with task-specific models. Code will be +released.",cs.CV,['cs.CV'] +HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation,Yongliang Lin · Yongzhi Su · Praveen Nathan · Sandeep Inuganti · Yan Di · Martin Sundermeyer · Fabian Manhardt · Didier Stricker · Jason Rambach · Yu Zhang, ,https://arxiv.org/abs/2311.12588,,2311.12588.pdf,HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation,"In this work, we present a novel dense-correspondence method for 6DoF object +pose estimation from a single RGB-D image. While many existing data-driven +methods achieve impressive performance, they tend to be time-consuming due to +their reliance on rendering-based refinement approaches. To circumvent this +limitation, we present HiPose, which establishes 3D-3D correspondences in a +coarse-to-fine manner with a hierarchical binary surface encoding. Unlike +previous dense-correspondence methods, we estimate the correspondence surface +by employing point-to-surface matching and iteratively constricting the surface +until it becomes a correspondence point while gradually removing outliers. +Extensive experiments on public benchmarks LM-O, YCB-V, and T-Less demonstrate +that our method surpasses all refinement-free methods and is even on par with +expensive refinement-based approaches. Crucially, our approach is +computationally efficient and enables real-time critical applications with high +accuracy requirements.",cs.CV,['cs.CV'] +Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D,Karran Pandey · Paul Guerrero · Matheus Gadelha · Yannick Hold-Geoffroy · Karan Singh · Niloy J. Mitra,https://diffusionhandles.github.io/,https://arxiv.org/abs/2312.02190,,2312.02190.pdf,Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D,"Diffusion Handles is a novel approach to enabling 3D object edits on +diffusion images. We accomplish these edits using existing pre-trained +diffusion models, and 2D image depth estimation, without any fine-tuning or 3D +object retrieval. The edited results remain plausible, photo-real, and preserve +object identity. Diffusion Handles address a critically missing facet of +generative image based creative design, and significantly advance the +state-of-the-art in generative image editing. Our key insight is to lift +diffusion activations for an object to 3D using a proxy depth, 3D-transform the +depth and associated activations, and project them back to image space. The +diffusion process applied to the manipulated activations with identity control, +produces plausible edited images showing complex 3D occlusion and lighting +effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic +data benchmark; and qualitatively by a user study, showing our output to be +more plausible, and better than prior art at both, 3D editing and identity +control. Project Webpage: https://diffusionhandles.github.io/",cs.CV,"['cs.CV', 'cs.GR']" +CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective,Shunsuke Yasuki · Masato Taki, ,,https://github.com/snskysk/CAM-Back-Again,,,,,nan +A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models,Julio Silva-Rodríguez · Sina Hajimiri · Ismail Ben Ayed · Jose Dolz,https://jusiro.github.io/projects/clap,https://arxiv.org/abs/2312.12730,,2312.12730.pdf,A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models,"Efficient transfer learning (ETL) is receiving increasing attention to adapt +large pre-trained language-vision models on downstream tasks with a few labeled +samples. While significant progress has been made, we reveal that +state-of-the-art ETL approaches exhibit strong performance only in +narrowly-defined experimental setups, and with a careful adjustment of +hyperparameters based on a large corpus of labeled samples. In particular, we +make two interesting, and surprising empirical observations. First, to +outperform a simple Linear Probing baseline, these methods require to optimize +their hyper-parameters on each target task. And second, they typically +underperform -- sometimes dramatically -- standard zero-shot predictions in the +presence of distributional drifts. Motivated by the unrealistic assumptions +made in the existing literature, i.e., access to a large validation set and +case-specific grid-search for optimal hyperparameters, we propose a novel +approach that meets the requirements of real-world scenarios. More concretely, +we introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing +term is optimized via an adaptation of the general Augmented Lagrangian method +tailored to this context. We comprehensively evaluate CLAP on a broad span of +datasets and scenarios, demonstrating that it consistently outperforms SoTA +approaches, while yet being a much more efficient alternative.",cs.CV,['cs.CV'] +DiaLoc: An Iterative Approach to Embodied Dialog Localization,Chao Zhang · Mohan Li · Ignas Budvytis · Stephan Liwicki, ,https://arxiv.org/abs/2403.06846,,2403.06846.pdf,DiaLoc: An Iterative Approach to Embodied Dialog Localization,"Multimodal learning has advanced the performance for many vision-language +tasks. However, most existing works in embodied dialog research focus on +navigation and leave the localization task understudied. The few existing +dialog-based localization approaches assume the availability of entire dialog +prior to localizaiton, which is impractical for deployed dialog-based +localization. In this paper, we propose DiaLoc, a new dialog-based localization +framework which aligns with a real human operator behavior. Specifically, we +produce an iterative refinement of location predictions which can visualize +current pose believes after each dialog turn. DiaLoc effectively utilizes the +multimodal data for multi-shot localization, where a fusion encoder fuses +vision and dialog information iteratively. We achieve state-of-the-art results +on embodied dialog-based localization task, in single-shot (+7.08% in +Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc +narrows the gap between simulation and real-world applications, opening doors +for future research on collaborative localization and navigation.",cs.CV,['cs.CV'] +De-Diffusion Makes Text a Strong Cross-Modal Interface,Chen Wei · Chenxi Liu · Siyuan Qiao · Zhishuai Zhang · Alan L. Yuille · Jiahui Yu, ,https://arxiv.org/abs/2311.00618,,2311.00618.pdf,De-Diffusion Makes Text a Strong Cross-Modal Interface,"We demonstrate text as a strong cross-modal interface. Rather than relying on +deep embeddings to connect image and language as the interface representation, +our approach represents an image as text, from which we enjoy the +interpretability and flexibility inherent to natural language. We employ an +autoencoder that uses a pre-trained text-to-image diffusion model for decoding. +The encoder is trained to transform an input image into text, which is then fed +into the fixed text-to-image diffusion decoder to reconstruct the original +input -- a process we term De-Diffusion. Experiments validate both the +precision and comprehensiveness of De-Diffusion text representing images, such +that it can be readily ingested by off-the-shelf text-to-image tools and LLMs +for diverse multi-modal tasks. For example, a single De-Diffusion model can +generalize to provide transferable prompts for different text-to-image tools, +and also achieves a new state of the art on open-ended vision-language tasks by +simply prompting large language models with few-shot examples.",cs.CV,['cs.CV'] +MMM: Generative Masked Motion Model,Ekkasit Pinyoanuntapong · Pu Wang · Minwoo Lee · Chen Chen, ,https://arxiv.org/abs/2312.03596,,2312.03596.pdf,MMM: Generative Masked Motion Model,"Recent advances in text-to-motion generation using diffusion and +autoregressive models have shown promising results. However, these models often +suffer from a trade-off between real-time performance, high fidelity, and +motion editability. To address this gap, we introduce MMM, a novel yet simple +motion generation paradigm based on Masked Motion Model. MMM consists of two +key components: (1) a motion tokenizer that transforms 3D human motion into a +sequence of discrete tokens in latent space, and (2) a conditional masked +motion transformer that learns to predict randomly masked motion tokens, +conditioned on the pre-computed text tokens. By attending to motion and text +tokens in all directions, MMM explicitly captures inherent dependency among +motion tokens and semantic mapping between motion and text tokens. During +inference, this allows parallel and iterative decoding of multiple motion +tokens that are highly consistent with fine-grained text descriptions, +therefore simultaneously achieving high-fidelity and high-speed motion +generation. In addition, MMM has innate motion editability. By simply placing +mask tokens in the place that needs editing, MMM automatically fills the gaps +while guaranteeing smooth transitions between editing and non-editing parts. +Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM +surpasses current leading methods in generating high-quality motion (evidenced +by superior FID scores of 0.08 and 0.429), while offering advanced editing +features such as body-part modification, motion in-betweening, and the +synthesis of long motion sequences. In addition, MMM is two orders of magnitude +faster on a single mid-range GPU than editable motion diffusion models. Our +project page is available at \url{https://exitudio.github.io/MMM-page}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +RMem: Restricted Memory Banks Improve Video Object Segmentation,Junbao Zhou · Ziqi Pang · Yu-Xiong Wang, ,https://arxiv.org/abs/2403.11529,,2403.11529.pdf,Video Object Segmentation with Dynamic Query Modulation,"Storing intermediate frame segmentations as memory for long-range context +modeling, spatial-temporal memory-based methods have recently showcased +impressive results in semi-supervised video object segmentation (SVOS). +However, these methods face two key limitations: 1) relying on non-local +pixel-level matching to read memory, resulting in noisy retrieved features for +segmentation; 2) segmenting each object independently without interaction. +These shortcomings make the memory-based methods struggle in similar object and +multi-object segmentation. To address these issues, we propose a query +modulation method, termed QMVOS. This method summarizes object features into +dynamic queries and then treats them as dynamic filters for mask prediction, +thereby providing high-level descriptions and object-level perception for the +model. Efficient and effective multi-object interactions are realized through +inter-query attention. Extensive experiments demonstrate that our method can +bring significant improvements to the memory-based SVOS method and achieve +competitive performance on standard SVOS benchmarks. The code is available at +https://github.com/zht8506/QMVOS.",cs.CV,['cs.CV'] +Neural Implicit Morphing of Face Images,Guilherme Schardong · Tiago Novello · Hallison Paz · Iurii Medvedev · Vinícius Silva · Luiz Velho · Nuno Gonçalves,https://schardong.github.io/ifmorph/,https://arxiv.org/abs/2308.13888,,2308.13888.pdf,Neural Implicit Morphing of Face Images,"Face morphing is a problem in computer graphics with numerous artistic and +forensic applications. It is challenging due to variations in pose, lighting, +gender, and ethnicity. This task consists of a warping for feature alignment +and a blending for a seamless transition between the warped images. We propose +to leverage coord-based neural networks to represent such warpings and +blendings of face images. During training, we exploit the smoothness and +flexibility of such networks by combining energy functionals employed in +classical approaches without discretizations. Additionally, our method is +time-dependent, allowing a continuous warping/blending of the images. During +morphing inference, we need both direct and inverse transformations of the +time-dependent warping. The first (second) is responsible for warping the +target (source) image into the source (target) image. Our neural warping stores +those maps in a single network dismissing the need for inverting them. The +results of our experiments indicate that our method is competitive with both +classical and generative models under the lens of image quality and +face-morphing detectors. Aesthetically, the resulting images present a seamless +blending of diverse faces not yet usual in the literature.",cs.CV,"['cs.CV', 'cs.LG', 'I.4.8; I.4.10']" +"Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action",Jiasen Lu · Christopher Clark · Sangho Lee · Zichen Zhang · Savya Khosla · Ryan Marten · Derek Hoiem · Aniruddha Kembhavi,https://unified-io-2.allenai.org/,https://arxiv.org/abs/2312.17172,,2312.17172.pdf,"Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action","We present Unified-IO 2, the first autoregressive multimodal model that is +capable of understanding and generating image, text, audio, and action. To +unify different modalities, we tokenize inputs and outputs -- images, text, +audio, action, bounding boxes, etc., into a shared semantic space and then +process them with a single encoder-decoder transformer model. Since training +with such diverse modalities is challenging, we propose various architectural +improvements to stabilize model training. We train our model from scratch on a +large multimodal pre-training corpus from diverse sources with a multimodal +mixture of denoisers objective. To learn an expansive set of skills, such as +following multimodal instructions, we construct and finetune on an ensemble of +120 datasets with prompts and augmentations. With a single unified model, +Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and +strong results in more than 35 benchmarks, including image generation and +understanding, natural language understanding, video and audio understanding, +and robotic manipulation. We release all our models to the research community.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +SUGAR: Pre-training 3D Visual Representation for Robotics,Shizhe Chen · Ricardo Garcia Pinel · Ivan Laptev · Cordelia Schmid, ,https://arxiv.org/abs/2404.01491,,2404.01491.pdf,SUGAR: Pre-training 3D Visual Representations for Robotics,"Learning generalizable visual representations from Internet data has yielded +promising results for robotics. Yet, prevailing approaches focus on +pre-training 2D representations, being sub-optimal to deal with occlusions and +accurately localize objects in complex 3D scenes. Meanwhile, 3D representation +learning has been limited to single-object understanding. To address these +limitations, we introduce a novel 3D pre-training framework for robotics named +SUGAR that captures semantic, geometric and affordance properties of objects +through 3D point clouds. We underscore the importance of cluttered scenes in 3D +representation learning, and automatically construct a multi-object dataset +benefiting from cost-free supervision in simulation. SUGAR employs a versatile +transformer-based model to jointly address five pre-training tasks, namely +cross-modal knowledge distillation for semantic learning, masked point modeling +to understand geometry structures, grasping pose synthesis for object +affordance, 3D instance segmentation and referring expression grounding to +analyze cluttered scenes. We evaluate our learned representation on three +robotic-related tasks, namely, zero-shot 3D object recognition, referring +expression grounding, and language-driven robotic manipulation. Experimental +results show that SUGAR's 3D representation outperforms state-of-the-art 2D and +3D representations.",cs.CV,['cs.CV'] +GenN2N: Generative NeRF2NeRF Translation,Xiangyue Liu · Han Xue · Kunming Luo · Ping Tan · Li Yi, ,https://arxiv.org/abs/2404.02788,,2404.02788.pdf,GenN2N: Generative NeRF2NeRF Translation,"We present GenN2N, a unified NeRF-to-NeRF translation framework for various +NeRF translation tasks such as text-driven NeRF editing, colorization, +super-resolution, inpainting, etc. Unlike previous methods designed for +individual translation tasks with task-specific schemes, GenN2N achieves all +these NeRF editing tasks by employing a plug-and-play image-to-image translator +to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF +space. Since the 3D consistency of 2D edits may not be assured, we propose to +model the distribution of the underlying 3D edits through a generative model +that can cover all possible edited NeRFs. To model the distribution of 3D +edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes +images while decoding NeRFs. The latent space is trained to align with a +Gaussian distribution and the NeRFs are supervised through an adversarial loss +on its renderings. To ensure the latent code does not depend on 2D viewpoints +but truly reflects the 3D edits, we also regularize the latent code through a +contrastive learning scheme. Extensive experiments on various editing tasks +show GenN2N, as a universal framework, performs as well or better than +task-specific specialists while possessing flexible generative power. More +results on our project page: https://xiangyueliu.github.io/GenN2N/",cs.CV,['cs.CV'] +UniHuman: A Unified Model For Editing Human Images in the Wild,Nannan Li · Qing Liu · Krishna Kumar Singh · Yilin Wang · Jianming Zhang · Bryan A. Plummer · Zhe Lin, ,https://arxiv.org/abs/2312.14985,,2312.14985.pdf,UniHuman: A Unified Model for Editing Human Images in the Wild,"Human image editing includes tasks like changing a person's pose, their +clothing, or editing the image according to a text prompt. However, prior work +often tackles these tasks separately, overlooking the benefit of mutual +reinforcement from learning them jointly. In this paper, we propose UniHuman, a +unified model that addresses multiple facets of human image editing in +real-world settings. To enhance the model's generation quality and +generalization capacity, we leverage guidance from human visual encoders and +introduce a lightweight pose-warping module that can exploit different pose +representations, accommodating unseen textures and patterns. Furthermore, to +bridge the disparity between existing human editing benchmarks with real-world +data, we curated 400K high-quality human image-text pairs for training and +collected 2K human images for out-of-domain testing, both encompassing diverse +clothing styles, backgrounds, and age groups. Experiments on both in-domain and +out-of-domain test sets demonstrate that UniHuman outperforms task-specific +models by a significant margin. In user studies, UniHuman is preferred by the +users in an average of 77% of cases. Our project is available at +https://github.com/NannanLi999/UniHuman.",cs.CV,['cs.CV'] +Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers,Tsai-Shien Chen · Aliaksandr Siarohin · Willi Menapace · Ekaterina Deyneka · Hsiang-wei Chao · Byung Jeon · Yuwei Fang · Hsin-Ying Lee · Jian Ren · Ming-Hsuan Yang · Sergey Tulyakov,https://snap-research.github.io/Panda-70M/,https://arxiv.org/abs/2402.19479,,2402.19479.pdf,Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers,"The quality of the data and annotation upper-bounds the quality of a +downstream model. While there exist large text corpora and image-text pairs, +high-quality video-text data is much harder to collect. First of all, manual +labeling is more time-consuming, as it requires an annotator to watch an entire +video. Second, videos have a temporal dimension, consisting of several scenes +stacked together, and showing multiple actions. Accordingly, to establish a +video dataset with high-quality captions, we propose an automatic approach +leveraging multimodal inputs, such as textual video description, subtitles, and +individual video frames. Specifically, we curate 3.8M high-resolution videos +from the publicly available HD-VILA-100M dataset. We then split them into +semantically consistent video clips, and apply multiple cross-modality teacher +models to obtain captions for each video. Next, we finetune a retrieval model +on a small subset where the best caption of each video is manually selected and +then employ the model in the whole dataset to select the best caption as the +annotation. In this way, we get 70M videos paired with high-quality text +captions. We dub the dataset as Panda-70M. We show the value of the proposed +dataset on three downstream tasks: video captioning, video and text retrieval, +and text-driven video generation. The models trained on the proposed data score +substantially better on the majority of metrics across all the tasks.",cs.CV,['cs.CV'] +Personalized Residuals for Concept-Driven Text-to-Image Generation,Cusuh Ham · Matthew Fisher · James Hays · Nicholas Kolkin · Yuchen Liu · Richard Zhang · Tobias Hinz, ,https://arxiv.org/abs/2405.12978,,2405.12978.pdf,Personalized Residuals for Concept-Driven Text-to-Image Generation,"We present personalized residuals and localized attention-guided sampling for +efficient concept-driven generation using text-to-image diffusion models. Our +method first represents concepts by freezing the weights of a pretrained +text-conditioned diffusion model and learning low-rank residuals for a small +subset of the model's layers. The residual-based approach then directly enables +application of our proposed sampling technique, which applies the learned +residuals only in areas where the concept is localized via cross-attention and +applies the original diffusion weights in all other regions. Localized sampling +therefore combines the learned identity of the concept with the existing +generative prior of the underlying diffusion model. We show that personalized +residuals effectively capture the identity of a concept in ~3 minutes on a +single GPU without the use of regularization images and with fewer parameters +than previous models, and localized sampling allows using the original model as +strong prior for large parts of the image.",cs.CV,['cs.CV'] +SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection,Mingxuan Liu · Tyler Hayes · Elisa Ricci · Gabriela Csurka · Riccardo Volpi,https://github.com/naver/shine,https://arxiv.org/abs/2405.10053,,2405.10053.pdf,SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection,"Open-vocabulary object detection (OvOD) has transformed detection into a +language-guided task, empowering users to freely define their class +vocabularies of interest during inference. However, our initial investigation +indicates that existing OvOD detectors exhibit significant variability when +dealing with vocabularies across various semantic granularities, posing a +concern for real-world deployment. To this end, we introduce Semantic Hierarchy +Nexus (SHiNe), a novel classifier that uses semantic knowledge from class +hierarchies. It runs offline in three steps: i) it retrieves relevant +super-/sub-categories from a hierarchy for each target class; ii) it integrates +these categories into hierarchy-aware sentences; iii) it fuses these sentence +embeddings to generate the nexus classifier vector. Our evaluation on various +detection benchmarks demonstrates that SHiNe enhances robustness across diverse +vocabulary granularities, achieving up to +31.9% mAP50 with ground truth +hierarchies, while retaining improvements using hierarchies generated by large +language models. Moreover, when applied to open-vocabulary classification on +ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. +SHiNe is training-free and can be seamlessly integrated with any off-the-shelf +OvOD detector, without incurring additional computational overhead during +inference. The code is open source.",cs.CV,['cs.CV'] +Diff-BGM: A Diffusion Model for Video Background Music Generation,Sizhe Li · Yiming Qin · Minghang Zheng · Xin Jin · Yang Liu, ,http://export.arxiv.org/abs/2405.11913,,2405.11913.pdf,Diff-BGM: A Diffusion Model for Video Background Music Generation,"When editing a video, a piece of attractive background music is +indispensable. However, video background music generation tasks face several +challenges, for example, the lack of suitable training datasets, and the +difficulties in flexibly controlling the music generation process and +sequentially aligning the video and music. In this work, we first propose a +high-quality music-video dataset BGM909 with detailed annotation and shot +detection to provide multi-modal information about the video and music. We then +present evaluation metrics to assess music quality, including music diversity +and alignment between music and video with retrieval precision metrics. +Finally, we propose the Diff-BGM framework to automatically generate the +background music for a given video, which uses different signals to control +different aspects of the music during the generation process, i.e., uses +dynamic video features to control music rhythm and semantic features to control +the melody and atmosphere. We propose to align the video and music sequentially +by introducing a segment-aware cross-attention layer. Experiments verify the +effectiveness of our proposed method. The code and models are available at +https://github.com/sizhelee/Diff-BGM.",cs.CV,['cs.CV'] +Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization,Shaohan Li · Yunpeng Shi · Gilad Lerman, ,,https://www.semanticscholar.org/paper/Fully-distributed-synchronization-on-directed-via-Xia-Li/23d2c7b0150d90992f60c1d8a94d263beacb2bb0,,,,,nan +Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection,Suyeon Kim · Dongha Lee · SeongKu Kang · Sukang Chae · Sanghwan Jang · Hwanjo Yu, ,https://arxiv.org/abs/2405.19902,,2405.19902.pdf,Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection,"Label noise, commonly found in real-world datasets, has a detrimental impact +on a model's generalization. To effectively detect incorrectly labeled +instances, previous works have mostly relied on distinguishable training +signals, such as training loss, as indicators to differentiate between clean +and noisy labels. However, they have limitations in that the training signals +incompletely reveal the model's behavior and are not effectively generalized to +various noise types, resulting in limited detection accuracy. In this paper, we +propose DynaCor framework that distinguishes incorrectly labeled instances from +correctly labeled ones based on the dynamics of the training signals. To cope +with the absence of supervision for clean and noisy labels, DynaCor first +introduces a label corruption strategy that augments the original dataset with +intentionally corrupted labels, enabling indirect simulation of the model's +behavior on noisy labels. Then, DynaCor learns to identify clean and noisy +instances by inducing two clearly distinguishable clusters from the latent +representations of training dynamics. Our comprehensive experiments show that +DynaCor outperforms the state-of-the-art competitors and shows strong +robustness to various noise types and noise rates.",cs.LG,"['cs.LG', 'stat.ML']" +Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation,Qi Yang · Xing Nie · Tong Li · Gaopengfei · Ying Guo · Cheng Zhen · Pengfei Yan · Shiming Xiang, ,https://arxiv.org/abs/2312.06462,,2312.06462.pdf,Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation,"Recently, an audio-visual segmentation (AVS) task has been introduced, aiming +to group pixels with sounding objects within a given video. This task +necessitates a first-ever audio-driven pixel-level understanding of the scene, +posing significant challenges. In this paper, we propose an innovative +audio-visual transformer framework, termed COMBO, an acronym for COoperation of +Multi-order Bilateral relatiOns. For the first time, our framework explores +three types of bilateral entanglements within AVS: pixel entanglement, modality +entanglement, and temporal entanglement. Regarding pixel entanglement, we +employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate +more precise visual features from the foundational model. For modality +entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to +align corresponding visual and auditory signals bi-directionally. As for +temporal entanglement, we introduce an innovative adaptive inter-frame +consistency loss according to the inherent rules of temporal. Comprehensive +experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou +on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that +COMBO surpasses previous state-of-the-art methods. Code and more results will +be publicly available at https://yannqi.github.io/AVS-COMBO/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.SD', 'eess.AS']" +EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars,Nikita Drobyshev · Antoni Bigata Casademunt · Konstantinos Vougioukas · Zoe Landgraf · Stavros Petridis · Maja Pantic, ,https://arxiv.org/abs/2404.19110,,2404.19110.pdf,EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars,"Head avatars animated by visual signals have gained popularity, particularly +in cross-driving synthesis where the driver differs from the animated +character, a challenging but highly practical approach. The recently presented +MegaPortraits model has demonstrated state-of-the-art results in this domain. +We conduct a deep examination and evaluation of this model, with a particular +focus on its latent space for facial expression descriptors, and uncover +several limitations with its ability to express intense face motions. To +address these limitations, we propose substantial changes in both training +pipeline and model architecture, to introduce our EMOPortraits model, where we: + Enhance the model's capability to faithfully support intense, asymmetric face +expressions, setting a new state-of-the-art result in the emotion transfer +task, surpassing previous methods in both metrics and quality. + Incorporate speech-driven mode to our model, achieving top-tier performance +in audio-driven facial animation, making it possible to drive source identity +through diverse modalities, including visual signal, audio, or a blend of both. + We propose a novel multi-view video dataset featuring a wide range of intense +and asymmetric facial expressions, filling the gap with absence of such data in +existing datasets.",cs.CV,['cs.CV'] +NC-TTT: A Noise Constrastive Approach for Test-Time Training,David OSOWIECHI · Gustavo Vargas Hakim · Mehrdad Noori · Milad Cheraghalikhani · Ali Bahri · Moslem Yazdanpanah · Ismail Ben Ayed · Christian Desrosiers, ,https://arxiv.org/abs/2404.08392,,2404.08392.pdf,NC-TTT: A Noise Contrastive Approach for Test-Time Training,"Despite their exceptional performance in vision tasks, deep learning models +often struggle when faced with domain shifts during testing. Test-Time Training +(TTT) methods have recently gained popularity by their ability to enhance the +robustness of models through the addition of an auxiliary objective that is +jointly optimized with the main task. Being strictly unsupervised, this +auxiliary objective is used at test time to adapt the model without any access +to labels. In this work, we propose Noise-Contrastive Test-Time Training +(NC-TTT), a novel unsupervised TTT technique based on the discrimination of +noisy feature maps. By learning to classify noisy views of projected feature +maps, and then adapting the model accordingly on new domains, classification +performance can be recovered by an important margin. Experiments on several +popular test-time adaptation baselines demonstrate the advantages of our method +compared to recent approaches for this task. The code can be found +at:https://github.com/GustavoVargasHakim/NCTTT.git",cs.CV,"['cs.CV', 'cs.LG']" +Forecasting of 3D Whole-body Human Poses with Grasping Objects,yan haitao · Qiongjie Cui · Jiexin Xie · Shijie Guo, ,https://arxiv.org/abs/2312.11972,,2312.11972.pdf,Expressive Forecasting of 3D Whole-body Human Motions,"Human motion forecasting, with the goal of estimating future human behavior +over a period of time, is a fundamental task in many real-world applications. +However, existing works typically concentrate on predicting the major joints of +the human body without considering the delicate movements of the human hands. +In practical applications, hand gesture plays an important role in human +communication with the real world, and expresses the primary intention of human +beings. In this work, we are the first to formulate a whole-body human pose +forecasting task, which jointly predicts the future body and hand activities. +Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) +framework that aims to predict both coarse (body joints) and fine-grained +(gestures) activities collaboratively, enabling expressive and +cross-facilitated forecasting of 3D whole-body human motions. Specifically, our +model involves two key constituents: cross-context alignment (XCA) and +cross-context interaction (XCI). Considering the heterogeneous information +within the whole-body, XCA aims to align the latent features of various human +components, while XCI focuses on effectively capturing the context interaction +among the human components. We conduct extensive experiments on a +newly-introduced large-scale benchmark and achieve state-of-the-art +performance. The code is public for research purposes at +https://github.com/Dingpx/EAI.",cs.CV,['cs.CV'] +QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction,Ishak Ayad · Nicolas Larue · Mai K. Nguyen, ,https://arxiv.org/abs/2402.17951,,2402.17951.pdf,QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction,"Inverse problems span across diverse fields. In medical contexts, computed +tomography (CT) plays a crucial role in reconstructing a patient's internal +structure, presenting challenges due to artifacts caused by inherently +ill-posed inverse problems. Previous research advanced image quality via +post-processing and deep unrolling algorithms but faces challenges, such as +extended convergence times with ultra-sparse data. Despite enhancements, +resulting images often show significant artifacts, limiting their effectiveness +for real-world diagnostic applications. We aim to explore deep second-order +unrolling algorithms for solving imaging inverse problems, emphasizing their +faster convergence and lower time complexity compared to common first-order +methods like gradient descent. In this paper, we introduce QN-Mixer, an +algorithm based on the quasi-Newton approach. We use learned parameters through +the BFGS algorithm and introduce Incept-Mixer, an efficient neural architecture +that serves as a non-local regularization term, capturing long-range +dependencies within images. To address the computational demands typically +associated with quasi-Newton algorithms that require full Hessian matrix +computations, we present a memory-efficient alternative. Our approach +intelligently downsamples gradient information, significantly reducing +computational requirements while maintaining performance. The approach is +validated through experiments on the sparse-view CT problem, involving various +datasets and scanning protocols, and is compared with post-processing and deep +unrolling state-of-the-art approaches. Our method outperforms existing +approaches and achieves state-of-the-art performance in terms of SSIM and PSNR, +all while reducing the number of unrolling iterations required.",eess.IV,"['eess.IV', 'cs.CV']" +Class Tokens Infusion for Weakly Supervised Semantic Segmentation,Sung-Hoon Yoon · Hoyong Kwon · Hyeonseong Kim · Kuk-Jin Yoon, ,http://export.arxiv.org/abs/2308.03005,,2308.03005.pdf,MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation,"This paper proposes a novel transformer-based framework that aims to enhance +weakly supervised semantic segmentation (WSSS) by generating accurate +class-specific object localization maps as pseudo labels. Building upon the +observation that the attended regions of the one-class token in the standard +vision transformer can contribute to a class-agnostic localization map, we +explore the potential of the transformer model to capture class-specific +attention for class-discriminative object localization by learning multiple +class tokens. We introduce a Multi-Class Token transformer, which incorporates +multiple class tokens to enable class-aware interactions with the patch tokens. +To achieve this, we devise a class-aware training strategy that establishes a +one-to-one correspondence between the output class tokens and the ground-truth +class labels. Moreover, a Contrastive-Class-Token (CCT) module is proposed to +enhance the learning of discriminative class tokens, enabling the model to +better capture the unique characteristics and properties of each class. As a +result, class-discriminative object localization maps can be effectively +generated by leveraging the class-to-patch attentions associated with different +class tokens. To further refine these localization maps, we propose the +utilization of patch-level pairwise affinity derived from the patch-to-patch +transformer attention. Furthermore, the proposed framework seamlessly +complements the Class Activation Mapping (CAM) method, resulting in +significantly improved WSSS performance on the PASCAL VOC 2012 and MS COCO 2014 +datasets. These results underline the importance of the class token for WSSS.",cs.CV,['cs.CV'] +Dual-consistency Model Inversion for Non-exemplar Class Incremental Learning,Zihuan Qiu · Yi Xu · Fanman Meng · Hongliang Li · Linfeng Xu · Qingbo Wu, ,https://ar5iv.labs.arxiv.org/html/2303.10891,,2303.10891.pdf,Non-Exemplar Online Class-incremental Continual Learning via Dual-prototype Self-augment and Refinement,"This paper investigates a new, practical, but challenging problem named +Non-exemplar Online Class-incremental continual Learning (NO-CL), which aims to +preserve the discernibility of base classes without buffering data examples and +efficiently learn novel classes continuously in a single-pass (i.e., online) +data stream. The challenges of this task are mainly two-fold: (1) Both base and +novel classes suffer from severe catastrophic forgetting as no previous samples +are available for replay. (2) As the online data can only be observed once, +there is no way to fully re-train the whole model, e.g., re-calibrate the +decision boundaries via prototype alignment or feature distillation. In this +paper, we propose a novel Dual-prototype Self-augment and Refinement method +(DSR) for NO-CL problem, which consists of two strategies: 1) Dual class +prototypes: vanilla and high-dimensional prototypes are exploited to utilize +the pre-trained information and obtain robust quasi-orthogonal representations +rather than example buffers for both privacy preservation and memory reduction. +2) Self-augment and refinement: Instead of updating the whole network, we +optimize high-dimensional prototypes alternatively with the extra projection +module based on self-augment vanilla prototypes, through a bi-level +optimization problem. Extensive experiments demonstrate the effectiveness and +superiority of the proposed DSR in NO-CL.",cs.CV,['cs.CV'] +MoDE: CLIP Data Experts via Clustering,Jiawei Ma · Po-Yao Huang · Saining Xie · Shang-Wen Li · Luke Zettlemoyer · Shih-Fu Chang · Wen-tau Yih · Hu Xu,https://github.com/facebookresearch/MetaCLIP/tree/main/mode,https://arxiv.org/abs/2404.16030,,2404.16030.pdf,MoDE: CLIP Data Experts via Clustering,"The success of contrastive language-image pretraining (CLIP) relies on the +supervision from the pairing between images and captions, which tends to be +noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn +a system of CLIP data experts via clustering. Each data expert is trained on +one data cluster, being less sensitive to false negative noises in other +clusters. At inference time, we ensemble their outputs by applying weights +determined through the correlation between task metadata and cluster +conditions. To estimate the correlation precisely, the samples in one cluster +should be semantically similar, but the number of data experts should still be +reasonable for training and inference. As such, we consider the ontology in +human language and propose to use fine-grained cluster centers to represent +each data expert at a coarse-grained level. Experimental studies show that four +CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and +OpenCLIP on zero-shot image classification but with less ($<$35\%) training +cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly +include new data experts. The code is available at +https://github.com/facebookresearch/MetaCLIP/tree/main/mode.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +FSC: Few-point Shape Completion,Xianzu Wu · Xianfeng Wu · Tianyu Luan · Yajing Bai · Zhongyuan Lai · Junsong Yuan, ,https://arxiv.org/abs/2403.07359v4,,2403.07359v4.pdf,FSC: Few-point Shape Completion,"While previous studies have demonstrated successful 3D object shape +completion with a sufficient number of points, they often fail in scenarios +when a few points, e.g. tens of points, are observed. Surprisingly, via entropy +analysis, we find that even a few points, e.g. 64 points, could retain +substantial information to help recover the 3D shape of the object. To address +the challenge of shape completion with very sparse point clouds, we then +propose Few-point Shape Completion (FSC) model, which contains a novel +dual-branch feature extractor for handling extremely sparse inputs, coupled +with an extensive branch for maximal point utilization with a saliency branch +for dynamic importance assignment. This model is further bolstered by a +two-stage revision network that refines both the extracted features and the +decoder output, enhancing the detail and authenticity of the completed point +cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from +a few points. The proposed Few-point Shape Completion (FSC) model outperforms +previous methods on both few-point inputs and many-point inputs, and shows good +generalizability to different object categories.",cs.CV,['cs.CV'] +Equivariant Multi-Modality Image Fusion,Zixiang Zhao · Haowen Bai · Jiangshe Zhang · Yulun Zhang · Kai Zhang · Shuang Xu · Dongdong Chen · Radu Timofte · Luc Van Gool, ,https://arxiv.org/abs/2402.02235,,2402.02235.pdf,Image Fusion via Vision-Language Model,"Image fusion integrates essential information from multiple source images +into a single composite, emphasizing the highlighting structure and textures, +and refining imperfect areas. Existing methods predominantly focus on +pixel-level and semantic visual features for recognition. However, they +insufficiently explore the deeper semantic information at a text-level beyond +vision. Therefore, we introduce a novel fusion paradigm named image Fusion via +vIsion-Language Model (FILM), for the first time, utilizing explicit textual +information in different source images to guide image fusion. In FILM, input +images are firstly processed to generate semantic prompts, which are then fed +into ChatGPT to obtain rich textual descriptions. These descriptions are fused +in the textual domain and guide the extraction of crucial visual features from +the source images through cross-attention, resulting in a deeper level of +contextual understanding directed by textual semantic information. The final +fused image is created by vision feature decoder. This paradigm achieves +satisfactory results in four image fusion tasks: infrared-visible, medical, +multi-exposure, and multi-focus image fusion. We also propose a vision-language +dataset containing ChatGPT-based paragraph descriptions for the ten image +fusion datasets in four fusion tasks, facilitating future research in +vision-language model-based image fusion. Code and dataset will be released.",cs.CV,['cs.CV'] +High-Quality Facial Geometry and Appearance Capture at Home,Yuxuan Han · Junfeng Lyu · Feng Xu,https://yxuhan.github.io/CoRA/index.html,https://arxiv.org/abs/2312.03442,,2312.03442.pdf,High-Quality Facial Geometry and Appearance Capture at Home,"Facial geometry and appearance capture have demonstrated tremendous success +in 3D scanning real humans in studios. Recent works propose to democratize this +technique while keeping the results high quality. However, they are still +inconvenient for daily usage. In addition, they focus on an easier problem of +only capturing facial skin. This paper proposes a novel method for high-quality +face capture, featuring an easy-to-use system and the capability to model the +complete face with skin, mouth interior, hair, and eyes. We reconstruct facial +geometry and appearance from a single co-located smartphone flashlight sequence +captured in a dim room where the flashlight is the dominant light source (e.g. +rooms with curtains or at night). To model the complete face, we propose a +novel hybrid representation to effectively model both eyes and other facial +regions, along with novel techniques to learn it from images. We apply a +combined lighting model to compactly represent real illuminations and exploit a +morphable face albedo model as a reflectance prior to disentangle diffuse and +specular. Experiments show that our method can capture high-quality 3D +relightable scans.",cs.CV,['cs.CV'] +Multi-Object Tracking in the Dark,Xinzhe Wang · Kang Ma · Qiankun Liu · Yunhao Zou · Ying Fu, ,https://arxiv.org/abs/2405.06600,,2405.06600.pdf,Multi-Object Tracking in the Dark,"Low-light scenes are prevalent in real-world applications (e.g. autonomous +driving and surveillance at night). Recently, multi-object tracking in various +practical use cases have received much attention, but multi-object tracking in +dark scenes is rarely considered. In this paper, we focus on multi-object +tracking in dark scenes. To address the lack of datasets, we first build a +Low-light Multi-Object Tracking (LMOT) dataset. LMOT provides well-aligned +low-light video pairs captured by our dual-camera system, and high-quality +multi-object tracking annotations for all videos. Then, we propose a low-light +multi-object tracking method, termed as LTrack. We introduce the adaptive +low-pass downsample module to enhance low-frequency components of images +outside the sensor noises. The degradation suppression learning strategy +enables the model to learn invariant information under noise disturbance and +image quality degradation. These components improve the robustness of +multi-object tracking in dark scenes. We conducted a comprehensive analysis of +our LMOT dataset and proposed LTrack. Experimental results demonstrate the +superiority of the proposed method and its competitiveness in real night +low-light scenes. Dataset and Code: https: //github.com/ying-fu/LMOT",cs.CV,['cs.CV'] +VideoCon: Robust Video-Language Alignment via Contrast Captions,Hritik Bansal · Yonatan Bitton · Idan Szpektor · Kai-Wei Chang · Aditya Grover, ,https://arxiv.org/abs/2311.10111,,2311.10111.pdf,VideoCon: Robust Video-Language Alignment via Contrast Captions,"Despite being (pre)trained on a massive amount of data, state-of-the-art +video-language alignment models are not robust to semantically-plausible +contrastive changes in the video captions. Our work addresses this by +identifying a broad spectrum of contrast misalignments, such as replacing +entities, actions, and flipping event order, which alignment models should be +robust against. To this end, we introduce the VideoCon, a video-language +alignment dataset constructed by a large language model that generates +plausible contrast video captions and explanations for differences between +original and contrast video captions. Then, a generative video-language model +is finetuned with VideoCon to assess video-language entailment and generate +explanations. Our VideoCon-based alignment model significantly outperforms +current models. It exhibits a 12-point increase in AUC for the video-language +alignment task on human-generated contrast captions. Finally, our model sets +new state of the art zero-shot performance in temporally-extensive +video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video +question answering (ATP-Hard). Moreover, our model shows superior performance +on novel videos and human-crafted captions and explanations. Our code and data +are available at https://github.com/Hritikbansal/videocon.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding,Sicong Leng · Hang Zhang · Guanzheng Chen · Xin Li · Shijian Lu · Chunyan Miao · Lidong Bing, ,https://arxiv.org/abs/2311.16922,,2311.16922.pdf,Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding,"Large Vision-Language Models (LVLMs) have advanced considerably, intertwining +visual recognition and language understanding to generate content that is not +only coherent but also contextually attuned. Despite their success, LVLMs still +suffer from the issue of object hallucinations, where models generate plausible +yet incorrect outputs that include objects that do not exist in the images. To +mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple +and training-free method that contrasts output distributions derived from +original and distorted visual inputs. The proposed VCD effectively reduces the +over-reliance on statistical bias and unimodal priors, two essential causes of +object hallucinations. This adjustment ensures the generated content is closely +grounded to visual inputs, resulting in contextually accurate outputs. Our +experiments show that VCD, without either additional training or the usage of +external tools, significantly mitigates the object hallucination issue across +different LVLM families. Beyond mitigating object hallucinations, VCD also +excels in general LVLM benchmarks, highlighting its wide-ranging applicability.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +SHINOBI: SHape and Illumination using Neural Object decomposition via BRDF optimization and Inverse rendering from unconstrained Image collections,Andreas Engelhardt · Amit Raj · Mark Boss · Yunzhi Zhang · Abhishek Kar · Yuanzhen Li · Ricardo Martin-Brualla · Jonathan T. Barron · Deqing Sun · Hendrik Lensch · Varun Jampani, ,https://arxiv.org/abs/2401.10171,,2401.10171.pdf,SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild,"We present SHINOBI, an end-to-end framework for the reconstruction of shape, +material, and illumination from object images captured with varying lighting, +pose, and background. Inverse rendering of an object based on unconstrained +image collections is a long-standing challenge in computer vision and graphics +and requires a joint optimization over shape, radiance, and pose. We show that +an implicit shape representation based on a multi-resolution hash encoding +enables faster and robust shape reconstruction with joint camera alignment +optimization that outperforms prior work. Further, to enable the editing of +illumination and object reflectance (i.e. material) we jointly optimize BRDF +and illumination together with the object's shape. Our method is class-agnostic +and works on in-the-wild image collections of objects to produce relightable 3D +assets for several use cases such as AR/VR, movies, games, etc. Project page: +https://shinobi.aengelhardt.com Video: +https://www.youtube.com/watch?v=iFENQ6AcYd8&feature=youtu.be",cs.CV,"['cs.CV', 'cs.GR']" +HEAL-SWIN: A Vision Transformer On The Sphere,Oscar Carlsson · Jan E. Gerken · Hampus Linander · Heiner Spiess · Fredrik Ohlsson · Christoffer Petersson · Daniel Persson, ,https://arxiv.org/abs/2307.07313,,2307.07313.pdf,HEAL-SWIN: A Vision Transformer On The Sphere,"High-resolution wide-angle fisheye images are becoming more and more +important for robotics applications such as autonomous driving. However, using +ordinary convolutional neural networks or vision transformers on this data is +problematic due to projection and distortion losses introduced when projecting +to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer, +which combines the highly uniform Hierarchical Equal Area iso-Latitude +Pixelation (HEALPix) grid used in astrophysics and cosmology with the +Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and +flexible model capable of training on high-resolution, distortion-free +spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used +to perform the patching and windowing operations of the SWIN transformer, +enabling the network to process spherical representations with minimal +computational overhead. We demonstrate the superior performance of our model on +both synthetic and real automotive datasets, as well as a selection of other +image datasets, for semantic segmentation, depth regression and classification +tasks. Our code is publicly available at +https://github.com/JanEGerken/HEAL-SWIN.",cs.CV,"['cs.CV', 'cs.LG']" +BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning,Siyuan Liang · Mingli Zhu · Aishan Liu · Baoyuan Wu · Xiaochun Cao · Ee-Chien Chang, ,https://arxiv.org/abs/2311.12075,,2311.12075.pdf,BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning,"Studying backdoor attacks is valuable for model copyright protection and +enhancing defenses. While existing backdoor attacks have successfully infected +multimodal contrastive learning models such as CLIP, they can be easily +countered by specialized backdoor defenses for MCL models. This paper reveals +the threats in this practical scenario that backdoor attacks can remain +effective even after defenses and introduces the \emph{\toolns} attack, which +is resistant to backdoor detection and model fine-tuning defenses. To achieve +this, we draw motivations from the perspective of the Bayesian rule and propose +a dual-embedding guided framework for backdoor attacks. Specifically, we ensure +that visual trigger patterns approximate the textual target semantics in the +embedding space, making it challenging to detect the subtle parameter +variations induced by backdoor learning on such natural trigger patterns. +Additionally, we optimize the visual trigger patterns to align the poisoned +samples with target vision features in order to hinder the backdoor unlearning +through clean fine-tuning. Extensive experiments demonstrate that our attack +significantly outperforms state-of-the-art baselines (+45.3% ASR) in the +presence of SoTA backdoor defenses, rendering these mitigation and detection +strategies virtually ineffective. Furthermore, our approach effectively attacks +some more rigorous scenarios like downstream tasks. We believe that this paper +raises awareness regarding the potential threats associated with the practical +application of multimodal contrastive learning and encourages the development +of more robust defense mechanisms.",cs.CV,['cs.CV'] +Flexible Depth Completion for Sparse and Varying Point Densities,Jinhyung Park · Yu-Jhe Li · Kris Kitani, ,https://arxiv.org/abs/2405.09342,,2405.09342.pdf,Progressive Depth Decoupling and Modulating for Flexible Depth Completion,"Image-guided depth completion aims at generating a dense depth map from +sparse LiDAR data and RGB image. Recent methods have shown promising +performance by reformulating it as a classification problem with two sub-tasks: +depth discretization and probability prediction. They divide the depth range +into several discrete depth values as depth categories, serving as priors for +scene depth distributions. However, previous depth discretization methods are +easy to be impacted by depth distribution variations across different scenes, +resulting in suboptimal scene depth distribution priors. To address the above +problem, we propose a progressive depth decoupling and modulating network, +which incrementally decouples the depth range into bins and adaptively +generates multi-scale dense depth maps in multiple stages. Specifically, we +first design a Bins Initializing Module (BIM) to construct the seed bins by +exploring the depth distribution information within a sparse depth map, +adapting variations of depth distribution. Then, we devise an incremental depth +decoupling branch to progressively refine the depth distribution information +from global to local. Meanwhile, an adaptive depth modulating branch is +developed to progressively improve the probability representation from +coarse-grained to fine-grained. And the bi-directional information interactions +are proposed to strengthen the information interaction between those two +branches (sub-tasks) for promoting information complementation in each branch. +Further, we introduce a multi-scale supervision mechanism to learn the depth +distribution information in latent features and enhance the adaptation +capability across different scenes. Experimental results on public datasets +demonstrate that our method outperforms the state-of-the-art methods. The code +will be open-sourced at [this https URL](https://github.com/Cisse-away/PDDM).",cs.CV,['cs.CV'] +Neural Fields as Distributions: Signal Processing Beyond Euclidean Space,Daniel Rebain · Soroosh Yazdani · Kwang Moo Yi · Andrea Tagliasacchi, ,https://arxiv.org/abs/2404.13024,,,BANF: Band-limited Neural Fields for Levels of Detail Reconstruction,"Largely due to their implicit nature, neural fields lack a direct mechanism +for filtering, as Fourier analysis from discrete signal processing is not +directly applicable to these representations. Effective filtering of neural +fields is critical to enable level-of-detail processing in downstream +applications, and support operations that involve sampling the field on regular +grids (e.g. marching cubes). Existing methods that attempt to decompose neural +fields in the frequency domain either resort to heuristics or require extensive +modifications to the neural field architecture. We show that via a simple +modification, one can obtain neural fields that are low-pass filtered, and in +turn show how this can be exploited to obtain a frequency decomposition of the +entire signal. We demonstrate the validity of our technique by investigating +level-of-detail reconstruction, and showing how coarser representations can be +computed effectively.",cs.CV,"['cs.CV', 'eess.IV']" +Accelerating Neural Field Training via Soft Mining,Shakiba Kheradmand · Daniel Rebain · Gopal Sharma · Hossam Isack · Abhishek Kar · Andrea Tagliasacchi · Kwang Moo Yi, ,https://arxiv.org/abs/2312.00075,,2312.00075.pdf,Accelerating Neural Field Training via Soft Mining,"We present an approach to accelerate Neural Field training by efficiently +selecting sampling locations. While Neural Fields have recently become popular, +it is often trained by uniformly sampling the training domain, or through +handcrafted heuristics. We show that improved convergence and final training +quality can be achieved by a soft mining technique based on importance +sampling: rather than either considering or ignoring a pixel completely, we +weigh the corresponding loss by a scalar. To implement our idea we use Langevin +Monte-Carlo sampling. We show that by doing so, regions with higher error are +being selected more frequently, leading to more than 2x improvement in +convergence speed. The code and related resources for this study are publicly +available at https://ubc-vision.github.io/nf-soft-mining/.",cs.CV,['cs.CV'] +Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences,Minyoung Hwang · Luca Weihs · Chanwoo Park · Kimin Lee · Aniruddha Kembhavi · Kiana Ehsani, ,https://arxiv.org/abs/2312.09337,,2312.09337.pdf,Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences,"Customizing robotic behaviors to be aligned with diverse human preferences is +an underexplored challenge in the field of embodied AI. In this paper, we +present Promptable Behaviors, a novel framework that facilitates efficient +personalization of robotic agents to diverse human preferences in complex +environments. We use multi-objective reinforcement learning to train a single +policy adaptable to a broad spectrum of preferences. We introduce three +distinct methods to infer human preferences by leveraging different types of +interactions: (1) human demonstrations, (2) preference feedback on trajectory +comparisons, and (3) language instructions. We evaluate the proposed method in +personalized object-goal navigation and flee navigation tasks in ProcTHOR and +RoboTHOR, demonstrating the ability to prompt agent behaviors to satisfy human +preferences in various scenarios. Project page: +https://promptable-behaviors.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +Step differences in instructional video,Tushar Nagarajan · Lorenzo Torresani, ,https://arxiv.org/abs/2404.16222,,2404.16222.pdf,Step Differences in Instructional Video,"Comparing a user video to a reference how-to video is a key requirement for +AR/VR technology delivering personalized assistance tailored to the user's +progress. However, current approaches for language-based assistance can only +answer questions about a single video. We propose an approach that first +automatically generates large amounts of visual instruction tuning data +involving pairs of videos from HowTo100M by leveraging existing step +annotations and accompanying narrations, and then trains a video-conditioned +language model to jointly reason across multiple raw videos. Our model achieves +state-of-the-art performance at identifying differences between video pairs and +ranking videos based on the severity of these differences, and shows promising +ability to perform general reasoning over multiple videos.",cs.CV,['cs.CV'] +LEMON: Learning 3D Human-Object Interaction Relation from 2D Images,Yuhang Yang · Wei Zhai · Hongchen Luo · Yang Cao · Zheng-Jun Zha,https://yyvhang.github.io/LEMON/,https://arxiv.org/abs/2312.08963,,2312.08963.pdf,LEMON: Learning 3D Human-Object Interaction Relation from 2D Images,"Learning 3D human-object interaction relation is pivotal to embodied AI and +interaction modeling. Most existing methods approach the goal by learning to +predict isolated interaction elements, e.g., human contact, object affordance, +and human-object spatial relation, primarily from the perspective of either the +human or the object. Which underexploit certain correlations between the +interaction counterparts (human and object), and struggle to address the +uncertainty in interactions. Actually, objects' functionalities potentially +affect humans' interaction intentions, which reveals what the interaction is. +Meanwhile, the interacting humans and objects exhibit matching geometric +structures, which presents how to interact. In light of this, we propose +harnessing these inherent correlations between interaction counterparts to +mitigate the uncertainty and jointly anticipate the above interaction elements +in 3D space. To achieve this, we present LEMON (LEarning 3D huMan-Object +iNteraction relation), a unified model that mines interaction intentions of the +counterparts and employs curvatures to guide the extraction of geometric +correlations, combining them to anticipate the interaction elements. Besides, +the 3D Interaction Relation dataset (3DIR) is collected to serve as the test +bed for training and evaluation. Extensive experiments demonstrate the +superiority of LEMON over methods estimating each element in isolation.",cs.CV,['cs.CV'] +Physical Property Understanding from Language-Embedded Feature Fields,Albert J. Zhai · Yuan Shen · Emily Y. Chen · Gloria Wang · Xinlei Wang · Sheng Wang · Kaiyu Guan · Shenlong Wang, ,https://arxiv.org/abs/2404.04242,,2404.04242.pdf,Physical Property Understanding from Language-Embedded Feature Fields,"Can computers perceive the physical properties of objects solely through +vision? Research in cognitive science and vision science has shown that humans +excel at identifying materials and estimating their physical properties based +purely on visual appearance. In this paper, we present a novel approach for +dense prediction of the physical properties of objects using a collection of +images. Inspired by how humans reason about physics through vision, we leverage +large language models to propose candidate materials for each object. We then +construct a language-embedded point cloud and estimate the physical properties +of each 3D point using a zero-shot kernel regression approach. Our method is +accurate, annotation-free, and applicable to any object in the open world. +Experiments demonstrate the effectiveness of the proposed approach in various +physical property reasoning tasks, such as estimating the mass of common +objects, as well as other properties like friction and hardness.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval,Fang Kaipeng · Jingkuan Song · Lianli Gao · Pengpeng Zeng · Zhi-Qi Cheng · Xiyao LI · Heng Tao Shen,https://github.com/fangkaipeng/ProS,https://arxiv.org/abs/2312.12478,,2312.12478.pdf,ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval,"The goal of Universal Cross-Domain Retrieval (UCDR) is to achieve robust +performance in generalized test scenarios, wherein data may belong to strictly +unknown domains and categories during training. Recently, pre-trained models +with prompt tuning have shown strong generalization capabilities and attained +noteworthy achievements in various downstream tasks, such as few-shot learning +and video-text retrieval. However, applying them directly to UCDR may not +sufficiently to handle both domain shift (i.e., adapting to unfamiliar domains) +and semantic shift (i.e., transferring to unknown categories). To this end, we +propose \textbf{Pro}mpting-to-\textbf{S}imulate (ProS), the first method to +apply prompt tuning for UCDR. ProS employs a two-step process to simulate +Content-aware Dynamic Prompts (CaDP) which can impact models to produce +generalized features for UCDR. Concretely, in Prompt Units Learning stage, we +introduce two Prompt Units to individually capture domain and semantic +knowledge in a mask-and-align way. Then, in Context-aware Simulator Learning +stage, we train a Content-aware Prompt Simulator under a simulated test +scenarios to produce the corresponding CaDP. Extensive experiments conducted on +three benchmark datasets show that our method achieves new state-of-the-art +performance without bringing excessive parameters. Our method is publicly +available at https://github.com/fangkaipeng/ProS.",cs.CV,['cs.CV'] +CoDi-2: Interleaved and In-Context Any-to-Any Generation,Zineng Tang · Ziyi Yang · MAHMOUD KHADEMI · Yang Liu · Chenguang Zhu · Mohit Bansal, ,https://arxiv.org/abs/2311.18775,,2311.18775.pdf,"CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation","We present CoDi-2, a versatile and interactive Multimodal Large Language +Model (MLLM) that can follow complex multimodal interleaved instructions, +conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any +input-output modality paradigm. By aligning modalities with language for both +encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not +only understand complex modality-interleaved instructions and in-context +examples, but also autoregressively generate grounded and coherent multimodal +outputs in the continuous feature space. To train CoDi-2, we build a +large-scale generation dataset encompassing in-context multimodal instructions +across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot +capabilities for multimodal generation, such as in-context learning, reasoning, +and compositionality of any-to-any modality generation through multi-round +interactive conversation. CoDi-2 surpasses previous domain-specific models on +tasks such as subject-driven image generation, vision transformation, and audio +editing. CoDi-2 signifies a substantial breakthrough in developing a +comprehensive multimodal foundation model adept at interpreting in-context +language-vision-audio interleaved instructions and producing multimodal +outputs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.SD', 'eess.AS']" +EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models,Jingyuan Yang · Jiawei Feng · Hui Huang, ,https://arxiv.org/abs/2401.04608,,2401.04608.pdf,EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models,"Recent years have witnessed remarkable progress in image generation task, +where users can create visually astonishing images with high-quality. However, +existing text-to-image diffusion models are proficient in generating concrete +concepts (dogs) but encounter challenges with more abstract ones (emotions). +Several efforts have been made to modify image emotions with color and style +adjustments, facing limitations in effectively conveying emotions with fixed +image contents. In this work, we introduce Emotional Image Content Generation +(EICG), a new task to generate semantic-clear and emotion-faithful images given +emotion categories. Specifically, we propose an emotion space and construct a +mapping network to align it with the powerful Contrastive Language-Image +Pre-training (CLIP) space, providing a concrete interpretation of abstract +emotions. Attribute loss and emotion confidence are further proposed to ensure +the semantic diversity and emotion fidelity of the generated images. Our method +outperforms the state-of-the-art text-to-image approaches both quantitatively +and qualitatively, where we derive three custom metrics, i.e., emotion +accuracy, semantic clarity and semantic diversity. In addition to generation, +our method can help emotion understanding and inspire emotional art design.",cs.CV,['cs.CV'] +Rapid 3D Model Generation with Intuitive 3D Input,Tianrun Chen · Chaotao Ding · Shangzhan Zhang · Chunan Yu · Ying Zang · Zejian Li · Sida Peng · Lingyun Sun, ,https://ar5iv.labs.arxiv.org/html/2309.13006,,2309.13006.pdf,Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches,"The rapid development of AR/VR brings tremendous demands for 3D content. +While the widely-used Computer-Aided Design (CAD) method requires a +time-consuming and labor-intensive modeling process, sketch-based 3D modeling +offers a potential solution as a natural form of computer-human interaction. +However, the sparsity and ambiguity of sketches make it challenging to generate +high-fidelity content reflecting creators' ideas. Precise drawing from multiple +views or strategic step-by-step drawings is often required to tackle the +challenge but is not friendly to novice users. In this work, we introduce a +novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only +a single free-hand sketch without inputting multiple sketches or view +information. Specifically, we introduce a lightweight generation network for +efficient inference in real-time and a structural-aware adversarial training +approach with a Stroke Enhancement Module (SEM) to capture the structural +information to facilitate learning of the realistic and fine-detailed shape +structures for high-fidelity performance. Extensive experiments demonstrated +the effectiveness of our approach with the state-of-the-art (SOTA) performance +on both synthetic and real datasets.",cs.CV,['cs.CV'] +L-MAGIC: Language Model Assisted Generation of Images with Consistency,zhipeng cai · Matthias Mueller · Reiner Birkl · Diana Wofk · Shao-Yen Tseng · JunDa Cheng · Gabriela Ben Melech Stan · Vasudev Lal · Michael Paulitsch, ,https://arxiv.org/abs/2311.16500,,2311.16500.pdf,LLMGA: Multimodal Large Language Model based Generation Assistant,"In this paper, we introduce a Multimodal Large Language Model-based +Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and +proficiency in reasoning, comprehension, and response inherent in Large +Language Models (LLMs) to assist users in image generation and editing. +Diverging from existing approaches where Multimodal Large Language Models +(MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our +LLMGA provides a detailed language generation prompt for precise control over +SD. This not only augments LLM context understanding but also reduces noise in +generation prompts, yields images with more intricate and precise content, and +elevates the interpretability of the network. To this end, we curate a +comprehensive dataset comprising prompt refinement, similar image generation, +inpainting \& outpainting, and instruction-based editing. Moreover, we propose +a two-stage training scheme. In the first stage, we train the MLLM to grasp the +properties of image generation and editing, enabling it to generate detailed +prompts. In the second stage, we optimize SD to align with the MLLM's +generation prompts. Additionally, we propose a reference-based restoration +network to alleviate texture, brightness, and contrast disparities between +generated and preserved regions during inpainting and outpainting. Extensive +results show that LLMGA has promising generation and editing capabilities and +can enable more flexible and expansive applications in an interactive manner.",cs.CV,['cs.CV'] +Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,Kim Youwang · Tae-Hyun Oh · Gerard Pons-Moll, ,https://arxiv.org/abs/2312.11360v1,,2312.11360v1.pdf,Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,"We present Paint-it, a text-driven high-fidelity texture map synthesis method +for 3D meshes via neural re-parameterized texture optimization. Paint-it +synthesizes texture maps from a text description by +synthesis-through-optimization, exploiting the Score-Distillation Sampling +(SDS). We observe that directly applying SDS yields undesirable texture quality +due to its noisy gradients. We reveal the importance of texture +parameterization when using SDS. Specifically, we propose Deep Convolutional +Physically-Based Rendering (DC-PBR) parameterization, which re-parameterizes +the physically-based rendering (PBR) texture maps with randomly initialized +convolution-based neural kernels, instead of a standard pixel-based +parameterization. We show that DC-PBR inherently schedules the optimization +curriculum according to texture frequency and naturally filters out the noisy +signals from SDS. In experiments, Paint-it obtains remarkable quality PBR +texture maps within 15 min., given only a text description. We demonstrate the +generalizability and practicality of Paint-it by synthesizing high-quality +texture maps for large-scale mesh datasets and showing test-time applications +such as relighting and material control using a popular graphics engine. +Project page: https://kim-youwang.github.io/paint-it",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']" +RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation,Peng Lu · Tao Jiang · Yining Li · Xiangtai Li · Kai Chen · Wenming Yang,https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo,https://arxiv.org/abs/2312.07526,,2312.07526.pdf,RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation,"Real-time multi-person pose estimation presents significant challenges in +balancing speed and precision. While two-stage top-down methods slow down as +the number of people in the image increases, existing one-stage methods often +fail to simultaneously deliver high accuracy and real-time performance. This +paper introduces RTMO, a one-stage pose estimation framework that seamlessly +integrates coordinate classification by representing keypoints using dual 1-D +heatmaps within the YOLO architecture, achieving accuracy comparable to +top-down methods while maintaining high speed. We propose a dynamic coordinate +classifier and a tailored loss function for heatmap learning, specifically +designed to address the incompatibilities between coordinate classification and +dense prediction models. RTMO outperforms state-of-the-art one-stage pose +estimators, achieving 1.1% higher AP on COCO while operating about 9 times +faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on +COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and +accuracy. The code and models are available at +https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.",cs.CV,['cs.CV'] +Multi-Session SLAM using Wide-Baseline Optical Flow,Lahav Lipson · Jia Deng, ,https://arxiv.org/abs/2404.15263,,2404.15263.pdf,Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization,"We introduce a new system for Multi-Session SLAM, which tracks camera motion +across multiple disjoint videos under a single global reference. Our approach +couples the prediction of optical flow with solver layers to estimate camera +pose. The backbone is trained end-to-end using a novel differentiable solver +for wide-baseline two-view pose. The full system can connect disjoint +sequences, perform visual odometry, and global optimization. Compared to +existing approaches, our design is accurate and robust to catastrophic +failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose",cs.CV,['cs.CV'] +Action Scene Graphs for Long-Form Understanding of Egocentric Videos,Ivan Rodin · Antonino Furnari · Kyle Min · Subarna Tripathi · Giovanni Maria Farinella,https://github.com/fpv-iplab/easg,https://arxiv.org/abs/2312.03391,,2312.03391.pdf,Action Scene Graphs for Long-Form Understanding of Egocentric Videos,"We present Egocentric Action Scene Graphs (EASGs), a new representation for +long-form understanding of egocentric videos. EASGs extend standard +manually-annotated representations of egocentric videos, such as verb-noun +action labels, by providing a temporally evolving graph-based description of +the actions performed by the camera wearer, including interacted objects, their +relationships, and how actions unfold in time. Through a novel annotation +procedure, we extend the Ego4D dataset by adding manually labeled Egocentric +Action Scene Graphs offering a rich set of annotations designed for long-from +egocentric video understanding. We hence define the EASG generation task and +provide a baseline approach, establishing preliminary benchmarks. Experiments +on two downstream tasks, egocentric action anticipation and egocentric activity +summarization, highlight the effectiveness of EASGs for long-form egocentric +video understanding. We will release the dataset and the code to replicate +experiments and annotations.",cs.CV,['cs.CV'] +Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations,Sangmin Lee · Bolin Lai · Fiona Ryan · Bikram Boote · James Rehg,https://sangmin-git.github.io/projects/MMSI,https://arxiv.org/abs/2403.02090,,2403.02090.pdf,Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations,"Understanding social interactions involving both verbal and non-verbal cues +is essential for effectively interpreting social situations. However, most +prior works on multimodal social cues focus predominantly on single-person +behaviors or rely on holistic visual representations that are not aligned to +utterances in multi-party environments. Consequently, they are limited in +modeling the intricate dynamics of multi-party interactions. In this paper, we +introduce three new challenging tasks to model the fine-grained dynamics +between multiple people: speaking target identification, pronoun coreference +resolution, and mentioned player prediction. We contribute extensive data +annotations to curate these new challenges in social deduction game settings. +Furthermore, we propose a novel multimodal baseline that leverages densely +aligned language-visual representations by synchronizing visual features with +their corresponding utterances. This facilitates concurrently capturing verbal +and non-verbal cues pertinent to social reasoning. Experiments demonstrate the +effectiveness of the proposed approach with densely aligned multimodal +representations in modeling fine-grained social interactions. Project website: +https://sangmin-git.github.io/projects/MMSI.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +Splatter Image: Ultra-Fast Single-View 3D Reconstruction,Stanislaw Szymanowicz · Christian Rupprecht · Andrea Vedaldi, ,https://arxiv.org/abs/2312.13150,,2312.13150.pdf,Splatter Image: Ultra-Fast Single-View 3D Reconstruction,"We introduce the \method, an ultra-efficient approach for monocular 3D object +reconstruction. Splatter Image is based on Gaussian Splatting, which allows +fast and high-quality reconstruction of 3D scenes from multiple images. We +apply Gaussian Splatting to monocular reconstruction by learning a neural +network that, at test time, performs reconstruction in a feed-forward manner, +at 38 FPS. Our main innovation is the surprisingly straightforward design of +this network, which, using 2D operators, maps the input image to one 3D +Gaussian per pixel. The resulting set of Gaussians thus has the form an image, +the Splatter Image. We further extend the method take several images as input +via cross-view attention. Owning to the speed of the renderer (588 FPS), we use +a single GPU for training while generating entire images at each iteration to +optimize perceptual metrics like LPIPS. On several synthetic, real, +multi-category and large-scale benchmark datasets, we achieve better results in +terms of PSNR, LPIPS, and other metrics while training and evaluating much +faster than prior works. Code, models, demo and more results are available at +https://szymanowiczs.github.io/splatter-image.",cs.CV,['cs.CV'] +MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception,Thien-Minh Nguyen · Shenghai Yuan · Thien Nguyen · Pengyu Yin · Haozhi Cao · Lihua Xie · Maciej Wozniak · Patric Jensfelt · Marko Thiel · Justin Ziegenbein · Noel Blunder, ,https://arxiv.org/abs/2403.11496,,2403.11496.pdf,MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception,"Perception plays a crucial role in various robot applications. However, +existing well-annotated datasets are biased towards autonomous driving +scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often +lack environment and domain variations. To expand the frontier of these fields, +we introduce a comprehensive dataset named MCD (Multi-Campus Dataset), +featuring a wide range of sensing modalities, high-accuracy ground truth, and +diverse challenging environments across three Eurasian university campuses. MCD +comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive +Epicyclic) lidars, high-quality IMUs (Inertial Measurement Units), cameras, and +UWB (Ultra-WideBand) sensors. Furthermore, in a pioneering effort, we introduce +semantic annotations of 29 classes over 59k sparse NRE lidar scans across three +domains, thus providing a novel challenge to existing semantic segmentation +research upon this largely unexplored lidar modality. Finally, we propose, for +the first time to the best of our knowledge, continuous-time ground truth based +on optimization-based registration of lidar-inertial data on large survey-grade +prior maps, which are also publicly released, each several times the size of +existing ones. We conduct a rigorous evaluation of numerous state-of-the-art +algorithms on MCD, report their performance, and highlight the challenges +awaiting solutions from the research community.",cs.RO,"['cs.RO', 'cs.AI']" +FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio,Chao Xu · Yang Liu · Jiazheng Xing · Weida Wang · Mingze Sun · Jun Dan · Tianxin Huang · Siyuan Li · Zhi-Qi Cheng · Ying Tai · Baigui Sun, ,https://arxiv.org/abs/2403.01901,,2403.01901.pdf,FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio,"In this paper, we abstract the process of people hearing speech, extracting +meaningful cues, and creating various dynamically audio-consistent talking +faces, termed Listening and Imagining, into the task of high-fidelity diverse +talking faces generation from a single audio. Specifically, it involves two +critical challenges: one is to effectively decouple identity, content, and +emotion from entangled audio, and the other is to maintain intra-video +diversity and inter-video consistency. To tackle the issues, we first dig out +the intricate relationships among facial factors and simplify the decoupling +process, tailoring a Progressive Audio Disentanglement for accurate facial +geometry and semantics learning, where each stage incorporates a customized +training module responsible for a specific factor. Secondly, to achieve +visually diverse and audio-synchronized animation solely from input audio +within a single model, we introduce the Controllable Coherent Frame generation, +which involves the flexible integration of three trainable adapters with frozen +Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and +semantics, as well as texture and temporal coherence between frames. In this +way, we inherit high-quality diverse generation from LDMs while significantly +improving their controllability at a low training cost. Extensive experiments +demonstrate the flexibility and effectiveness of our method in handling this +paradigm. The codes will be released at +https://github.com/modelscope/facechain.",cs.CV,['cs.CV'] +NeISF: Neural Incident Stokes Field for Geometry and Material Estimation,Chenhao Li · Taishi Ono · Takeshi Uemori · Hajime Mihara · Alexander Gatto · Hajime Nagahara · Yusuke Moriuchi, ,https://arxiv.org/abs/2311.13187v1,,2311.13187v1.pdf,NeISF: Neural Incident Stokes Field for Geometry and Material Estimation,"Multi-view inverse rendering is the problem of estimating the scene +parameters such as shapes, materials, or illuminations from a sequence of +images captured under different viewpoints. Many approaches, however, assume +single light bounce and thus fail to recover challenging scenarios like +inter-reflections. On the other hand, simply extending those methods to +consider multi-bounced light requires more assumptions to alleviate the +ambiguity. To address this problem, we propose Neural Incident Stokes Fields +(NeISF), a multi-view inverse rendering framework that reduces ambiguities +using polarization cues. The primary motivation for using polarization cues is +that it is the accumulation of multi-bounced light, providing rich information +about geometry and material. Based on this knowledge, the proposed incident +Stokes field efficiently models the accumulated polarization effect with the +aid of an original physically-based differentiable polarimetric renderer. +Lastly, experimental results show that our method outperforms the existing +works in synthetic and real scenarios.",cs.CV,['cs.CV'] +PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection,Kuan-Chih Huang · Weijie Lyu · Ming-Hsuan Yang · Yi-Hsuan Tsai, ,https://arxiv.org/abs/2312.08371,,2312.08371.pdf,PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection,"Recent temporal LiDAR-based 3D object detectors achieve promising performance +based on the two-stage proposal-based approach. They generate 3D box candidates +from the first-stage dense detector, followed by different temporal aggregation +methods. However, these approaches require per-frame objects or whole point +clouds, posing challenges related to memory bank utilization. Moreover, point +clouds and trajectory features are combined solely based on concatenation, +which may neglect effective interactions between them. In this paper, we +propose a point-trajectory transformer with long short-term memory for +efficient temporal 3D object detection. To this end, we only utilize point +clouds of current-frame objects and their historical trajectories as input to +minimize the memory bank storage requirement. Furthermore, we introduce modules +to encode trajectory features, focusing on long short-term and future-aware +perspectives, and then effectively aggregate them with point cloud features. We +conduct extensive experiments on the large-scale Waymo dataset to demonstrate +that our approach performs well against state-of-the-art methods. Code and +models will be made publicly available at https://github.com/kuanchihhuang/PTT.",cs.CV,['cs.CV'] +SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation,Chen Sichen · Yingyi Zhang · Siming Huang · Ran Yi · Ke Fan · Ruixin Zhang · Peixian Chen · Jun Wang · Shouhong Ding · Lizhuang Ma, ,https://arxiv.org/abs/2404.03518,,2404.03518.pdf,SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation,"Recently, transformer-based methods have achieved state-of-the-art prediction +quality on human pose estimation(HPE). Nonetheless, most of these +top-performing transformer-based models are too computation-consuming and +storage-demanding to deploy on edge computing platforms. Those +transformer-based models that require fewer resources are prone to +under-fitting due to their smaller scale and thus perform notably worse than +their larger counterparts. Given this conundrum, we introduce SDPose, a new +self-distillation method for improving the performance of small +transformer-based models. To mitigate the problem of under-fitting, we design a +transformer module named Multi-Cycled Transformer(MCT) based on multiple-cycled +forwards to more fully exploit the potential of small model parameters. +Further, in order to prevent the additional inference compute-consuming brought +by MCT, we introduce a self-distillation scheme, extracting the knowledge from +the MCT module to a naive forward model. Specifically, on the MSCOCO validation +dataset, SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs. +Furthermore, SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset +with 6.2M parameters and 4.7 GFLOPs, achieving a new state-of-the-art among +predominant tiny neural network methods. Our code is available at +https://github.com/MartyrPenink/SDPose.",cs.CV,['cs.CV'] +Uncertainty-aware Action Decoupling Transformer for Action Anticipation,Hongji Guo · Nakul Agarwal · Shao-Yuan Lo · Kwonjoon Lee · Qiang Ji, ,https://arxiv.org/abs/2309.16397,,2309.16397.pdf,Uncertainty-Aware Decision Transformer for Stochastic Driving Environments,"Offline Reinforcement Learning (RL) has emerged as a promising framework for +learning policies without active interactions, making it especially appealing +for autonomous driving tasks. Recent successes of Transformers inspire casting +offline RL as sequence modeling, which performs well in long-horizon tasks. +However, they are overly optimistic in stochastic environments with incorrect +assumptions that the same goal can be consistently achieved by identical +actions. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer +(UNREST) for planning in stochastic driving environments without introducing +additional transition or complex generative models. Specifically, UNREST +estimates state uncertainties by the conditional mutual information between +transitions and returns, and segments sequences accordingly. Discovering the +`uncertainty accumulation' and `temporal locality' properties of driving +environments, UNREST replaces the global returns in decision transformers with +less uncertain truncated returns, to learn from true outcomes of agent actions +rather than environment transitions. We also dynamically evaluate environmental +uncertainty during inference for cautious planning. Extensive experimental +results demonstrate UNREST's superior performance in various driving scenarios +and the power of our uncertainty estimation strategy.",cs.LG,"['cs.LG', 'cs.AI']" +Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM,Linyu Tang · Lei Zhang, ,https://arxiv.org/abs/2403.11448,,2403.11448.pdf,Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM,"Numerous studies have demonstrated the susceptibility of deep neural networks +(DNNs) to subtle adversarial perturbations, prompting the development of many +advanced adversarial defense methods aimed at mitigating adversarial attacks. +Current defense strategies usually train DNNs for a specific adversarial attack +method and can achieve good robustness in defense against this type of +adversarial attack. Nevertheless, when subjected to evaluations involving +unfamiliar attack modalities, empirical evidence reveals a pronounced +deterioration in the robustness of DNNs. Meanwhile, there is a trade-off +between the classification accuracy of clean examples and adversarial examples. +Most defense methods often sacrifice the accuracy of clean examples in order to +improve the adversarial robustness of DNNs. To alleviate these problems and +enhance the overall robust generalization of DNNs, we propose the Test-Time +Pixel-Level Adversarial Purification (TPAP) method. This approach is based on +the robust overfitting characteristic of DNNs to the fast gradient sign method +(FGSM) on training and test datasets. It utilizes FGSM for adversarial +purification, to process images for purifying unknown adversarial perturbations +from pixels at testing time in a ""counter changes with changelessness"" manner, +thereby enhancing the defense capability of DNNs against various unknown +adversarial attacks. Extensive experimental results show that our method can +effectively improve both overall robust generalization of DNNs, notably over +previous methods.",cs.CV,['cs.CV'] +Compositional Video Understanding with Spatiotemporal Structure-based Transformers,Hoyeoung Yun · Jinwoo Ahn · Minseo Kim · Eun-Sol Kim, ,https://arxiv.org/abs/2401.10831,,2401.10831.pdf,Understanding Video Transformers via Universal Concept Discovery,"This paper studies the problem of concept-based interpretability of +transformer representations for videos. Concretely, we seek to explain the +decision-making process of video transformers based on high-level, +spatiotemporal concepts that are automatically discovered. Prior research on +concept-based interpretability has concentrated solely on image-level tasks. +Comparatively, video models deal with the added temporal dimension, increasing +complexity and posing challenges in identifying dynamic concepts over time. In +this work, we systematically address these challenges by introducing the first +Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose +an efficient approach for unsupervised identification of units of video +transformer representations - concepts, and ranking their importance to the +output of a model. The resulting concepts are highly interpretable, revealing +spatio-temporal reasoning mechanisms and object-centric representations in +unstructured video models. Performing this analysis jointly over a diverse set +of supervised and self-supervised representations, we discover that some of +these mechanism are universal in video transformers. Finally, we show that VTCD +can be used for fine-grained action recognition and video object segmentation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']" +Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration,Mingyuan Meng · Dagan Feng · Lei Bi · Jinman Kim, ,https://arxiv.org/abs/2311.16707,,2311.16707.pdf,Full-resolution MLPs Empower Medical Dense Prediction,"Dense prediction is a fundamental requirement for many medical vision tasks +such as medical image restoration, registration, and segmentation. The most +popular vision model, Convolutional Neural Networks (CNNs), has reached +bottlenecks due to the intrinsic locality of convolution operations. Recently, +transformers have been widely adopted for dense prediction for their capability +to capture long-range visual dependence. However, due to the high computational +complexity and large memory consumption of self-attention operations, +transformers are usually used at downsampled feature resolutions. Such usage +cannot effectively leverage the tissue-level textural information available +only at the full image resolution. This textural information is crucial for +medical dense prediction as it can differentiate the subtle human anatomy in +medical images. In this study, we hypothesize that Multi-layer Perceptrons +(MLPs) are superior alternatives to transformers in medical dense prediction +where tissue-level details dominate the performance, as MLPs enable long-range +dependence at the full image resolution. To validate our hypothesis, we develop +a full-resolution hierarchical MLP framework that uses MLPs beginning from the +full image resolution. We evaluate this framework with various MLP blocks on a +wide range of medical dense prediction tasks including restoration, +registration, and segmentation. Extensive experiments on six public +well-benchmarked datasets show that, by simply using MLPs at full resolution, +our framework outperforms its CNN and transformer counterparts and achieves +state-of-the-art performance on various medical dense prediction tasks.",eess.IV,"['eess.IV', 'cs.CV']" +Multimodal Representation Learning by Alternating Unimodal Adaptation,Xiaohui Zhang · Xiaohui Zhang · Jaehong Yoon · Mohit Bansal · Huaxiu Yao, ,https://arxiv.org/abs/2311.10707,,2311.10707.pdf,Multimodal Representation Learning by Alternating Unimodal Adaptation,"Multimodal learning, which integrates data from diverse sensory modes, plays +a pivotal role in artificial intelligence. However, existing multimodal +learning methods often struggle with challenges where some modalities appear +more dominant than others during multimodal learning, resulting in suboptimal +performance. To address this challenge, we propose MLA (Multimodal Learning +with Alternating Unimodal Adaptation). MLA reframes the conventional joint +multimodal learning process by transforming it into an alternating unimodal +learning process, thereby minimizing interference between modalities. +Simultaneously, it captures cross-modal interactions through a shared head, +which undergoes continuous optimization across different modalities. This +optimization process is controlled by a gradient modification mechanism to +prevent the shared head from losing previously acquired information. During the +inference phase, MLA utilizes a test-time uncertainty-based model fusion +mechanism to integrate multimodal information. Extensive experiments are +conducted on five diverse datasets, encompassing scenarios with complete +modalities and scenarios with missing modalities. These experiments demonstrate +the superiority of MLA over competing prior approaches. Our code is available +at +https://github.com/Cecile-hi/Multimodal-Learning-with-Alternating-Unimodal-Adaptation.",cs.LG,"['cs.LG', 'cs.CV']" +Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition,Xiang Li · Jinglu Wang · Xiaohao Xu · Xiulian Peng · Rita Singh · Yan Lu · Bhiksha Raj, ,https://arxiv.org/abs/2310.00132,,2310.00132.pdf,QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition,"Audiovisual segmentation (AVS) is a challenging task that aims to segment +visual objects in videos according to their associated acoustic cues. With +multiple sound sources and background disturbances involved, establishing +robust correspondences between audio and visual contents poses unique +challenges due to (1) complex entanglement across sound sources and (2) +frequent changes in the occurrence of distinct sound events. Assuming sound +events occur independently, the multi-source semantic space can be represented +as the Cartesian product of single-source sub-spaces. We are motivated to +decompose the multi-source audio semantics into single-source semantics for +more effective interactions with visual content. We propose a semantic +decomposition method based on product quantization, where the multi-source +semantics can be decomposed and represented by several disentangled and +noise-suppressed single-source semantics. Furthermore, we introduce a +global-to-local quantization mechanism, which distills knowledge from stable +global (clip-level) features into local (frame-level) ones, to handle frequent +changes in audio semantics. Extensive experiments demonstrate that our +semantically decomposed audio representation significantly improves AVS +performance, e.g., +21.2% mIoU on the challenging AVS-Semantic benchmark with +ResNet50 backbone. https://github.com/lxa9867/QSD.",cs.CV,['cs.CV'] +MuRF: Multi-Baseline Radiance Fields,Haofei Xu · Anpei Chen · Yuedong Chen · Christos Sakaridis · Yulun Zhang · Marc Pollefeys · Andreas Geiger · Fisher Yu,https://haofeixu.github.io/murf/,https://arxiv.org/abs/2312.04565v1,,2312.04565v1.pdf,MuRF: Multi-Baseline Radiance Fields,"We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward +approach to solving sparse view synthesis under multiple different baseline +settings (small and large baselines, and different number of input views). To +render a target novel view, we discretize the 3D space into planes parallel to +the target image plane, and accordingly construct a target view frustum volume. +Such a target volume representation is spatially aligned with the target view, +which effectively aggregates relevant information from the input views for +high-quality rendering. It also facilitates subsequent radiance field +regression with a convolutional network thanks to its axis-aligned nature. The +3D context modeled by the convolutional network enables our method to synthesis +sharper scene structures than prior works. Our MuRF achieves state-of-the-art +performance across multiple different baseline settings and diverse scenarios +ranging from simple objects (DTU) to complex indoor and outdoor scenes +(RealEstate10K and LLFF). We also show promising zero-shot generalization +abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability +of MuRF.",cs.CV,['cs.CV'] +Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection,Taeheon Kim · Sebin Shin · Youngjoon Yu · Hak Gu Kim · Yong Man Ro, ,https://arxiv.org/abs/2403.01300,,2403.01300.pdf,Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection,"RGBT multispectral pedestrian detection has emerged as a promising solution +for safety-critical applications that require day/night operations. However, +the modality bias problem remains unsolved as multispectral pedestrian +detectors learn the statistical bias in datasets. Specifically, datasets in +multispectral pedestrian detection mainly distribute between ROTO (day) and +RXTO (night) data; the majority of the pedestrian labels statistically co-occur +with their thermal features. As a result, multispectral pedestrian detectors +show poor generalization ability on examples beyond this statistical +correlation, such as ROTX data. To address this problem, we propose a novel +Causal Mode Multiplexer (CMM) framework that effectively learns the causalities +between multispectral inputs and predictions. Moreover, we construct a new +dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian +detection. ROTX-MP mainly includes ROTX examples not presented in previous +datasets. Extensive experiments demonstrate that our proposed CMM framework +generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new +ROTX-MP. We will release our new dataset to the public for future research.",cs.CV,['cs.CV'] +GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo,Jiang Wu · Rui Li · Haofei Xu · Wenxun Zhao · Yu Zhu · Jinqiu Sun · Yanning Zhang, ,https://arxiv.org/abs/2404.07992v1,,2404.07992v1.pdf,GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo,"Matching cost aggregation plays a fundamental role in learning-based +multi-view stereo networks. However, directly aggregating adjacent costs can +lead to suboptimal results due to local geometric inconsistency. Related +methods either seek selective aggregation or improve aggregated depth in the 2D +space, both are unable to handle geometric inconsistency in the cost volume +effectively. In this paper, we propose GoMVS to aggregate geometrically +consistent costs, yielding better utilization of adjacent geometries. More +specifically, we correspond and propagate adjacent costs to the reference pixel +by leveraging the local geometric smoothness in conjunction with surface +normals. We achieve this by the geometric consistent propagation (GCP) module. +It computes the correspondence from the adjacent depth hypothesis space to the +reference depth space using surface normals, then uses the correspondence to +propagate adjacent costs to the reference geometry, followed by a convolution +for aggregation. Our method achieves new state-of-the-art performance on DTU, +Tanks & Temple, and ETH3D datasets. Notably, our method ranks 1st on the Tanks +& Temple Advanced benchmark.",cs.CV,['cs.CV'] +Test-Time Adaptation for Depth Completion,Hyoungseob Park · Anjali W Gupta · Alex Wong, ,https://arxiv.org/abs/2402.03312,,2402.03312.pdf,Test-Time Adaptation for Depth Completion,"It is common to observe performance degradation when transferring models +trained on some (source) datasets to target testing data due to a domain gap +between them. Existing methods for bridging this gap, such as domain adaptation +(DA), may require the source data on which the model was trained (often not +available), while others, i.e., source-free DA, require many passes through the +testing data. We propose an online test-time adaptation method for depth +completion, the task of inferring a dense depth map from a single image and +associated sparse depth map, that closes the performance gap in a single pass. +We first present a study on how the domain shift in each data modality affects +model performance. Based on our observations that the sparse depth modality +exhibits a much smaller covariate shift than the image, we design an embedding +module trained in the source domain that preserves a mapping from features +encoding only sparse depth to those encoding image and sparse depth. During +test time, sparse depth features are projected using this map as a proxy for +source domain features and are used as guidance to train a set of auxiliary +parameters (i.e., adaptation layer) to align image and sparse depth features +from the target test domain to that of the source domain. We evaluate our +method on indoor and outdoor scenarios and show that it improves over baselines +by an average of 21.1%.",cs.CV,"['cs.CV', 'cs.LG']" +FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning,Rishub Tamirisa · Chulin Xie · Wenxuan Bao · Andy Zhou · Ron Arel · Aviv Shamsian, ,https://arxiv.org/abs/2404.02478,,2404.02478.pdf,FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning,"Standard federated learning approaches suffer when client data distributions +have sufficient heterogeneity. Recent methods addressed the client data +heterogeneity issue via personalized federated learning (PFL) - a class of FL +algorithms aiming to personalize learned global knowledge to better suit the +clients' local data distributions. Existing PFL methods usually decouple global +updates in deep neural networks by performing personalization on particular +layers (i.e. classifier heads) and global aggregation for the rest of the +network. However, preselecting network layers for personalization may result in +suboptimal storage of global knowledge. In this work, we propose FedSelect, a +novel PFL algorithm inspired by the iterative subnetwork discovery procedure +used for the Lottery Ticket Hypothesis. FedSelect incrementally expands +subnetworks to personalize client parameters, concurrently conducting global +aggregations on the remaining parameters. This approach enables the +personalization of both client parameters and subnetwork structure during the +training process. Finally, we show that FedSelect outperforms recent +state-of-the-art PFL algorithms under challenging client data heterogeneity +settings and demonstrates robustness to various real-world distributional +shifts. Our code is available at https://github.com/lapisrocks/fedselect.",cs.LG,"['cs.LG', 'cs.AI']" +Mitigating Motion Blur in Neural Radiance Fields with Events and Frames,Marco Cannici · Davide Scaramuzza,https://github.com/uzh-rpg/EvDeblurNeRF,https://arxiv.org/abs/2403.19780,,2403.19780.pdf,Mitigating Motion Blur in Neural Radiance Fields with Events and Frames,"Neural Radiance Fields (NeRFs) have shown great potential in novel view +synthesis. However, they struggle to render sharp images when the data used for +training is affected by motion blur. On the other hand, event cameras excel in +dynamic scenes as they measure brightness changes with microsecond resolution +and are thus only marginally affected by blur. Recent methods attempt to +enhance NeRF reconstructions under camera motion by fusing frames and events. +However, they face challenges in recovering accurate color content or constrain +the NeRF to a set of predefined camera poses, harming reconstruction quality in +challenging conditions. This paper proposes a novel formulation addressing +these issues by leveraging both model- and learning-based modules. We +explicitly model the blur formation process, exploiting the event double +integral as an additional model-based prior. Additionally, we model the +event-pixel response using an end-to-end learnable response function, allowing +our method to adapt to non-idealities in the real event-camera sensor. We show, +on synthetic and real data, that the proposed approach outperforms existing +deblur NeRFs that use only frames as well as those that combine frames and +events by +6.13dB and +2.48dB, respectively.",cs.CV,['cs.CV'] +Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection,Jongha Kim · Jihwan Park · Jinyoung Park · Jinyoung Kim · Sehyung Kim · Hyunwoo J. Kim,https://github.com/mlvlab/speaq,https://arxiv.org/abs/2403.17709,,2403.17709.pdf,Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection,"Visual Relationship Detection (VRD) has seen significant advancements with +Transformer-based architectures recently. However, we identify two key +limitations in a conventional label assignment for training Transformer-based +VRD models, which is a process of mapping a ground-truth (GT) to a prediction. +Under the conventional assignment, an unspecialized query is trained since a +query is expected to detect every relation, which makes it difficult for a +query to specialize in specific relations. Furthermore, a query is also +insufficiently trained since a GT is assigned only to a single prediction, +therefore near-correct or even correct predictions are suppressed by being +assigned no relation as a GT. To address these issues, we propose Groupwise +Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise +Query Specialization trains a specialized query by dividing queries and +relations into disjoint groups and directing a query in a specific query group +solely toward relations in the corresponding relation group. Quality-Aware +Multi-Assignment further facilitates the training by assigning a GT to multiple +predictions that are significantly close to a GT in terms of a subject, an +object, and the relation in between. Experimental results and analyses show +that SpeaQ effectively trains specialized queries, which better utilize the +capacity of a model, resulting in consistent performance gains with zero +additional inference cost across multiple VRD models and benchmarks. Code is +available at https://github.com/mlvlab/SpeaQ.",cs.CV,['cs.CV'] +LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images,Jing Zhang · Irving Fang · Hao Wu · Akshat Kaushik · Alice Rodriguez · Hanwen Zhao · Juexiao Zhang · Zhuo Zheng · Radu Iovita · Chen Feng, ,https://arxiv.org/abs/2403.13171,,2403.13171.pdf,LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images,"Lithic Use-Wear Analysis (LUWA) using microscopic images is an underexplored +vision-for-science research area. It seeks to distinguish the worked material, +which is critical for understanding archaeological artifacts, material +interactions, tool functionalities, and dental records. However, this +challenging task goes beyond the well-studied image classification problem for +common objects. It is affected by many confounders owing to the complex wear +mechanism and microscopic imaging, which makes it difficult even for human +experts to identify the worked material successfully. In this paper, we +investigate the following three questions on this unique vision task for the +first time:(i) How well can state-of-the-art pre-trained models (like DINOv2) +generalize to the rarely seen domain? (ii) How can few-shot learning be +exploited for scarce microscopic images? (iii) How do the ambiguous +magnification and sensing modality influence the classification accuracy? To +study these, we collaborated with archaeologists and built the first +open-source and the largest LUWA dataset containing 23,130 microscopic images +with different magnifications and sensing modalities. Extensive experiments +show that existing pre-trained models notably outperform human experts but +still leave a large gap for improvements. Most importantly, the LUWA dataset +provides an underexplored opportunity for vision and learning communities and +complements existing image classification problems on common objects.",cs.CV,['cs.CV'] +Flow-Guided Online Stereo Rectification for Wide Baseline Stereo,Anush Kumar · Fahim Mannan · Omid Hosseini Jafari · Shile Li · Felix Heide,https://light.princeton.edu/online-stereo-recification/,https://arxiv.org/abs/2309.10314,,2309.10314.pdf,Dive Deeper into Rectifying Homography for Stereo Camera Online Self-Calibration,"Accurate estimation of stereo camera extrinsic parameters is the key to +guarantee the performance of stereo matching algorithms. In prior arts, the +online self-calibration of stereo cameras has commonly been formulated as a +specialized visual odometry problem, without taking into account the principles +of stereo rectification. In this paper, we first delve deeply into the concept +of rectifying homography, which serves as the cornerstone for the development +of our novel stereo camera online self-calibration algorithm, for cases where +only a single pair of images is available. Furthermore, we introduce a simple +yet effective solution for global optimum extrinsic parameter estimation in the +presence of stereo video sequences. Additionally, we emphasize the +impracticality of using three Euler angles and three components in the +translation vectors for performance quantification. Instead, we introduce four +new evaluation metrics to quantify the robustness and accuracy of extrinsic +parameter estimation, applicable to both single-pair and multi-pair cases. +Extensive experiments conducted across indoor and outdoor environments using +various experimental setups validate the effectiveness of our proposed +algorithm. The comprehensive evaluation results demonstrate its superior +performance in comparison to the baseline algorithm. Our source code, demo +video, and supplement are publicly available at mias.group/StereoCalibrator.",cs.RO,"['cs.RO', 'cs.CV']" +Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving,Brian Yang · Huangyuan Su · Nikolaos Gkanatsios · Tsung-Wei Ke · Ayush Jain · Jeff Schneider · Katerina Fragkiadaki, ,https://arxiv.org/abs/2402.06559,,2402.06559.pdf,Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous Driving and Zero-Shot Instruction Following,"Diffusion models excel at modeling complex and multimodal trajectory +distributions for decision-making and control. Reward-gradient guided denoising +has been recently proposed to generate trajectories that maximize both a +differentiable reward function and the likelihood under the data distribution +captured by a diffusion model. Reward-gradient guided denoising requires a +differentiable reward function fitted to both clean and noised samples, +limiting its applicability as a general trajectory optimizer. In this paper, we +propose DiffusionES, a method that combines gradient-free optimization with +trajectory denoising to optimize black-box non-differentiable objectives while +staying in the data manifold. Diffusion-ES samples trajectories during +evolutionary search from a diffusion model and scores them using a black-box +reward function. It mutates high-scoring trajectories using a truncated +diffusion process that applies a small number of noising and denoising steps, +allowing for much more efficient exploration of the solution space. We show +that DiffusionES achieves state-of-the-art performance on nuPlan, an +established closed-loop planning benchmark for autonomous driving. Diffusion-ES +outperforms existing sampling-based planners, reactive deterministic or +diffusion-based policies, and reward-gradient guidance. Additionally, we show +that unlike prior guidance methods, our method can optimize non-differentiable +language-shaped reward functions generated by few-shot LLM prompting. When +guided by a human teacher that issues instructions to follow, our method can +generate novel, highly complex behaviors, such as aggressive lane weaving, +which are not present in the training data. This allows us to solve the hardest +nuPlan scenarios which are beyond the capabilities of existing trajectory +optimization methods and driving policies.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL', 'cs.RO']" +Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training,Yipeng Gao · Zeyu Wang · Wei-Shi Zheng · Cihang Xie · Yuyin Zhou,https://github.com/UCSC-VLAA/MixCon3D,https://arxiv.org/abs/2311.01734,,2311.01734.pdf,Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training,"Contrastive learning has emerged as a promising paradigm for 3D open-world +understanding, i.e., aligning point cloud representation to image and text +embedding space individually. In this paper, we introduce MixCon3D, a simple +yet effective method aiming to sculpt holistic 3D representation in contrastive +language-image-3D pre-training. In contrast to point cloud only, we develop the +3D object-level representation from complementary perspectives, e.g., +multi-view rendered images with the point cloud. Then, MixCon3D performs +language-3D contrastive learning, comprehensively depicting real-world 3D +objects and bolstering text alignment. Additionally, we pioneer the first +thorough investigation of various training recipes for the 3D contrastive +learning paradigm, building a solid baseline with improved performance. +Extensive experiments conducted on three representative benchmarks reveal that +our method significantly improves over the baseline, surpassing the previous +state-of-the-art performance on the challenging 1,156-category Objaverse-LVIS +dataset by 5.7%. The versatility of MixCon3D is showcased in applications such +as text-to-3D retrieval and point cloud captioning, further evidencing its +efficacy in diverse scenarios. The code is available at +https://github.com/UCSC-VLAA/MixCon3D.",cs.CV,['cs.CV'] +Cross-spectral Gated-RGB Stereo Depth Estimation,Samuel Brucker · Stefanie Walz · Mario Bijelic · Felix Heide,https://light.princeton.edu/publication/gatedrccbstereo/,https://arxiv.org/abs/2405.12759,,2405.12759.pdf,Cross-spectral Gated-RGB Stereo Depth Estimation,"Gated cameras flood-illuminate a scene and capture the time-gated impulse +response of a scene. By employing nanosecond-scale gates, existing sensors are +capable of capturing mega-pixel gated images, delivering dense depth improving +on today's LiDAR sensors in spatial resolution and depth precision. Although +gated depth estimation methods deliver a million of depth estimates per frame, +their resolution is still an order below existing RGB imaging methods. In this +work, we combine high-resolution stereo HDR RCCB cameras with gated imaging, +allowing us to exploit depth cues from active gating, multi-view RGB and +multi-view NIR sensing -- multi-view and gated cues across the entire spectrum. +The resulting capture system consists only of low-cost CMOS sensors and +flood-illumination. We propose a novel stereo-depth estimation method that is +capable of exploiting these multi-modal multi-view depth cues, including the +active illumination that is measured by the RCCB camera when removing the +IR-cut filter. The proposed method achieves accurate depth at long ranges, +outperforming the next best existing method by 39% for ranges of 100 to 220m in +MAE on accumulated LiDAR ground-truth. Our code, models and datasets are +available at https://light.princeton.edu/gatedrccbstereo/ .",cs.CV,['cs.CV'] +$\mathcal{Z}^*$: Zero-shot $\underline{S}$tyle $\underline{T}$ransfer via $\underline{A}$ttention $\underline{R}$eweighting,Yingying Deng · Xiangyu He · Fan Tang · Weiming Dong, ,https://arxiv.org/abs/2311.16491,,2311.16491.pdf,$Z^*$: Zero-shot Style Transfer via Attention Rearrangement,"Despite the remarkable progress in image style transfer, formulating style in +the context of art is inherently subjective and challenging. In contrast to +existing learning/tuning methods, this study shows that vanilla diffusion +models can directly extract style information and seamlessly integrate the +generative prior into the content image without retraining. Specifically, we +adopt dual denoising paths to represent content/style references in latent +space and then guide the content image denoising process with style latent +codes. We further reveal that the cross-attention mechanism in latent diffusion +models tends to blend the content and style images, resulting in stylized +outputs that deviate from the original content image. To overcome this +limitation, we introduce a cross-attention rearrangement strategy. Through +theoretical analysis and experiments, we demonstrate the effectiveness and +superiority of the diffusion-based $\underline{Z}$ero-shot $\underline{S}$tyle +$\underline{T}$ransfer via $\underline{A}$ttention $\underline{R}$earrangement, +Z-STAR.",cs.CV,['cs.CV'] +CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images,Aaron Gokaslan · A. Feder Cooper · Jasmine Collins · Landan Seguin · Austin Jacobson · Mihir Patel · Jonathan Frankle · Cory Stephenson · Volodymyr Kuleshov, ,https://arxiv.org/abs/2310.16825,,2310.16825.pdf,CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images,"We assemble a dataset of Creative-Commons-licensed (CC) images, which we use +to train a set of open diffusion models that are qualitatively competitive with +Stable Diffusion 2 (SD2). This task presents two challenges: (1) +high-resolution CC images lack the captions necessary to train text-to-image +generative models; (2) CC images are relatively scarce. In turn, to address +these challenges, we use an intuitive transfer learning technique to produce a +set of high-quality synthetic captions paired with curated CC images. We then +develop a data- and compute-efficient training recipe that requires as little +as 3% of the LAION-2B data needed to train existing SD2 models, but obtains +comparable quality. These results indicate that we have a sufficient number of +CC images (~70 million) for training high-quality models. Our training recipe +also implements a variety of optimizations that achieve ~3X training speed-ups, +enabling rapid model iteration. We leverage this recipe to train several +high-quality text-to-image models, which we dub the CommonCanvas family. Our +largest model achieves comparable performance to SD2 on a human evaluation, +despite being trained on our CC dataset that is significantly smaller than +LAION and using synthetic captions for training. We release our models, data, +and code at +https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md",cs.CV,"['cs.CV', 'cs.CY']" +"HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild",Supreeth Narasimhaswamy · Huy Anh Nguyen · Lihan Huang · Minh Hoai, ,https://arxiv.org/abs/2404.13819,,2404.13819.pdf,"HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild","We address the challenging task of identifying, segmenting, and tracking +hand-held objects, which is crucial for applications such as human action +segmentation and performance evaluation. This task is particularly challenging +due to heavy occlusion, rapid motion, and the transitory nature of objects +being hand-held, where an object may be held, released, and subsequently picked +up again. To tackle these challenges, we have developed a novel +transformer-based architecture called HOIST-Former. HOIST-Former is adept at +spatially and temporally segmenting hands and objects by iteratively pooling +features from each other, ensuring that the processes of identification, +segmentation, and tracking of hand-held objects depend on the hands' positions +and their contextual appearance. We further refine HOIST-Former with a contact +loss that focuses on areas where hands are in contact with objects. Moreover, +we also contribute an in-the-wild video dataset called HOIST, which comprises +4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs +for hand-held objects. Through experiments on the HOIST dataset and two +additional public datasets, we demonstrate the efficacy of HOIST-Former in +segmenting and tracking hand-held objects.",cs.CV,['cs.CV'] +SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples,Phillip Howard · Avinash Madasu · Tiep Le · Gustavo Lujan-Moreno · Anahita Bhiwandiwalla · Vasudev Lal, ,https://arxiv.org/abs/2312.00825,,2312.00825.pdf,SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples,"While vision-language models (VLMs) have achieved remarkable performance +improvements recently, there is growing evidence that these models also posses +harmful biases with respect to social attributes such as gender and race. Prior +studies have primarily focused on probing such bias attributes individually +while ignoring biases associated with intersections between social attributes. +This could be due to the difficulty of collecting an exhaustive set of +image-text pairs for various combinations of social attributes. To address this +challenge, we employ text-to-image diffusion models to produce counterfactual +examples for probing intersectional social biases at scale. Our approach +utilizes Stable Diffusion with cross attention control to produce sets of +counterfactual image-text pairs that are highly similar in their depiction of a +subject (e.g., a given occupation) while differing only in their depiction of +intersectional social attributes (e.g., race & gender). Through our +over-generate-then-filter methodology, we produce SocialCounterfactuals, a +high-quality dataset containing 171k image-text pairs for probing +intersectional biases related to gender, race, and physical characteristics. We +conduct extensive experiments to demonstrate the usefulness of our generated +dataset for probing and mitigating intersectional social biases in +state-of-the-art VLMs.",cs.CV,"['cs.CV', 'cs.AI']" +Accurate Training Data for Occupancy Map Prediction in Automated Driving using Evidence Theory,Jonas Kälble · Sascha Wirges · Maxim Tatarchenko · Eddy Ilg, ,https://arxiv.org/abs/2405.10575,,2405.10575.pdf,Accurate Training Data for Occupancy Map Prediction in Automated Driving Using Evidence Theory,"Automated driving fundamentally requires knowledge about the surrounding +geometry of the scene. Modern approaches use only captured images to predict +occupancy maps that represent the geometry. Training these approaches requires +accurate data that may be acquired with the help of LiDAR scanners. We show +that the techniques used for current benchmarks and training datasets to +convert LiDAR scans into occupancy grid maps yield very low quality, and +subsequently present a novel approach using evidence theory that yields more +accurate reconstructions. We demonstrate that these are superior by a large +margin, both qualitatively and quantitatively, and that we additionally obtain +meaningful uncertainty estimates. When converting the occupancy maps back to +depth estimates and comparing them with the raw LiDAR measurements, our method +yields a MAE improvement of 30% to 52% on nuScenes and 53% on Waymo over other +occupancy ground-truth data. Finally, we use the improved occupancy maps to +train a state-of-the-art occupancy prediction method and demonstrate that it +improves the MAE by 25% on nuScenes.",cs.CV,['cs.CV'] +ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models,Fei Kong · Jinhao Duan · Lichao Sun · Hao Cheng · Renjing Xu · Heng Tao Shen · Xiaofeng Zhu · Xiaoshuang Shi · Kaidi Xu, ,https://arxiv.org/abs/2311.14097,,2311.14097.pdf,ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models,"Though diffusion models excel in image generation, their step-by-step +denoising leads to slow generation speeds. Consistency training addresses this +issue with single-step sampling but often produces lower-quality generations +and requires high training costs. In this paper, we show that optimizing +consistency training loss minimizes the Wasserstein distance between target and +generated distributions. As timestep increases, the upper bound accumulates +previous consistency training losses. Therefore, larger batch sizes are needed +to reduce both current and accumulated losses. We propose Adversarial +Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS) +divergence between distributions at each timestep using a discriminator. +Theoretically, ACT enhances generation quality, and convergence. By +incorporating a discriminator into the consistency training framework, our +method achieves improved FID scores on CIFAR10 and ImageNet 64$\times$64 and +LSUN Cat 256$\times$256 datasets, retains zero-shot image inpainting +capabilities, and uses less than $1/6$ of the original batch size and fewer +than $1/2$ of the model parameters and training steps compared to the baseline +method, this leads to a substantial reduction in resource consumption. Our code +is available:https://github.com/kong13661/ACT",cs.CV,['cs.CV'] +StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation,Sidi Wu · Yizi Chen · Loic Landrieu · Nicolas Gonthier · Samuel Mermet · Lorenz Hurni · Konrad Schindler, ,https://arxiv.org/abs/2403.20142,,2403.20142.pdf,StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation,"Most image-to-image translation models postulate that a unique correspondence +exists between the semantic classes of the source and target domains. However, +this assumption does not always hold in real-world scenarios due to divergent +distributions, different class sets, and asymmetrical information +representation. As conventional GANs attempt to generate images that match the +distribution of the target domain, they may hallucinate spurious instances of +classes absent from the source domain, thereby diminishing the usefulness and +reliability of translated images. CycleGAN-based methods are also known to hide +the mismatched information in the generated images to bypass cycle consistency +objectives, a process known as steganography. In response to the challenge of +non-bijective image translation, we introduce StegoGAN, a novel model that +leverages steganography to prevent spurious features in generated images. Our +approach enhances the semantic consistency of the translated images without +requiring additional postprocessing or supervision. Our experimental +evaluations demonstrate that StegoGAN outperforms existing GAN-based models +across various non-bijective image-to-image translation tasks, both +qualitatively and quantitatively. Our code and pretrained models are accessible +at https://github.com/sian-wusidi/StegoGAN.",cs.CV,"['cs.CV', 'eess.IV']" +LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset,Haolin Liu · Chongjie Ye · Yinyu Nie · Yingfan He · Xiaoguang Han, ,https://arxiv.org/html/2312.12418v1,,2312.12418v1.pdf,LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset,"Instance shape reconstruction from a 3D scene involves recovering the full +geometries of multiple objects at the semantic instance level. Many methods +leverage data-driven learning due to the intricacies of scene complexity and +significant indoor occlusions. Training these methods often requires a +large-scale, high-quality dataset with aligned and paired shape annotations +with real-world scans. Existing datasets are either synthetic or misaligned, +restricting the performance of data-driven methods on real data. To this end, +we introduce LASA, a Large-scale Aligned Shape Annotation Dataset comprising +10,412 high-quality CAD annotations aligned with 920 real-world scene scans +from ArkitScenes, created manually by professional artists. On this top, we +propose a novel Diffusion-based Cross-Modal Shape Reconstruction (DisCo) +method. It is empowered by a hybrid feature aggregation design to fuse +multi-modal inputs and recover high-fidelity object geometries. Besides, we +present an Occupancy-Guided 3D Object Detection (OccGOD) method and demonstrate +that our shape annotations provide scene occupancy clues that can further +improve 3D object detection. Supported by LASA, extensive experiments show that +our methods achieve state-of-the-art performance in both instance-level scene +reconstruction and 3D object detection tasks.",cs.CV,['cs.CV'] +Unsupervised Keypoints from Pretrained Diffusion Models,Eric Hedlin · Gopal Sharma · Shweta Mahajan · Xingzhe He · Hossam Isack · Abhishek Kar · Helge Rhodin · Andrea Tagliasacchi · Kwang Moo Yi, ,https://arxiv.org/abs/2312.00065,,2312.00065.pdf,Unsupervised Keypoints from Pretrained Diffusion Models,"Unsupervised learning of keypoints and landmarks has seen significant +progress with the help of modern neural network architectures, but performance +is yet to match the supervised counterpart, making their practicability +questionable. We leverage the emergent knowledge within text-to-image diffusion +models, towards more robust unsupervised keypoints. Our core idea is to find +text embeddings that would cause the generative model to consistently attend to +compact regions in images (i.e. keypoints). To do so, we simply optimize the +text embedding such that the cross-attention maps within the denoising network +are localized as Gaussians with small standard deviations. We validate our +performance on multiple datasets: the CelebA, CUB-200-2011, Tai-Chi-HD, +DeepFashion, and Human3.6m datasets. We achieve significantly improved +accuracy, sometimes even outperforming supervised ones, particularly for data +that is non-aligned and less curated. Our code is publicly available and can be +found through our project page: https://ubc-vision.github.io/StableKeypoints/",cs.CV,['cs.CV'] +READ: Retrieval-Enhanced Asymmetric Diffusion for Motion Planning,Takeru Oba · Matthew Walter · Norimichi Ukita,https://obat2343.github.io/READ.github.io/,http://export.arxiv.org/abs/2308.01557,,2308.01557.pdf,Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models,"Learning priors on trajectory distributions can help accelerate robot motion +planning optimization. Given previously successful plans, learning trajectory +generative models as priors for a new planning problem is highly desirable. +Prior works propose several ways on utilizing this prior to bootstrapping the +motion planning problem. Either sampling the prior for initializations or using +the prior distribution in a maximum-a-posterior formulation for trajectory +optimization. In this work, we propose learning diffusion models as priors. We +then can sample directly from the posterior trajectory distribution conditioned +on task goals, by leveraging the inverse denoising process of diffusion models. +Furthermore, diffusion has been recently shown to effectively encode data +multimodality in high-dimensional settings, which is particularly well-suited +for large trajectory dataset. To demonstrate our method efficacy, we compare +our proposed method - Motion Planning Diffusion - against several baselines in +simulated planar robot and 7-dof robot arm manipulator environments. To assess +the generalization capabilities of our method, we test it in environments with +previously unseen obstacles. Our experiments show that diffusion models are +strong priors to encode high-dimensional trajectory distributions of robot +motions.",cs.RO,"['cs.RO', 'cs.AI', 'cs.LG']" +On the Estimation of Image-matching Uncertainty in Visual Place Recognition,Mubariz Zaffar · Liangliang Nan · Julian F. P. Kooij, ,https://arxiv.org/abs/2404.00546,,2404.00546.pdf,On the Estimation of Image-matching Uncertainty in Visual Place Recognition,"In Visual Place Recognition (VPR) the pose of a query image is estimated by +comparing the image to a map of reference images with known reference poses. As +is typical for image retrieval problems, a feature extractor maps the query and +reference images to a feature space, where a nearest neighbor search is then +performed. However, till recently little attention has been given to +quantifying the confidence that a retrieved reference image is a correct match. +Highly certain but incorrect retrieval can lead to catastrophic failure of +VPR-based localization pipelines. This work compares for the first time the +main approaches for estimating the image-matching uncertainty, including the +traditional retrieval-based uncertainty estimation, more recent data-driven +aleatoric uncertainty estimation, and the compute-intensive geometric +verification. We further formulate a simple baseline method, ``SUE'', which +unlike the other methods considers the freely-available poses of the reference +images in the map. Our experiments reveal that a simple L2-distance between the +query and reference descriptors is already a better estimate of image-matching +uncertainty than current data-driven approaches. SUE outperforms the other +efficient uncertainty estimation methods, and its uncertainty estimates +complement the computationally expensive geometric verification approach. +Future works for uncertainty estimation in VPR should consider the baselines +discussed in this work.",cs.CV,['cs.CV'] +GROUNDHOG: Grounding Large Language Models to Holistic Segmentation,Yichi Zhang · Ziqiao Ma · Xiaofeng Gao · Suhaila Shakiah · Qiaozi Gao · Joyce Chai,https://groundhog-mllm.github.io/,https://arxiv.org/abs/2402.16846,,2402.16846.pdf,GROUNDHOG: Grounding Large Language Models to Holistic Segmentation,"Most multimodal large language models (MLLMs) learn language-to-object +grounding through causal language modeling where grounded objects are captured +by bounding boxes as sequences of location tokens. This paradigm lacks +pixel-level representations that are important for fine-grained visual +understanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLM +developed by grounding Large Language Models to holistic segmentation. +GROUNDHOG incorporates a masked feature extractor and converts extracted +features into visual entity tokens for the MLLM backbone, which then connects +groundable phrases to unified grounding masks by retrieving and merging the +entity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visual +instruction tuning dataset with Multi-Modal Multi-Grained Grounding, by +harvesting a collection of segmentation-grounded datasets with rich +annotations. Our experimental results show that GROUNDHOG achieves superior +performance on various language grounding tasks without task-specific +fine-tuning, and significantly reduces object hallucination. GROUNDHOG also +demonstrates better grounding towards complex forms of visual input and +provides easy-to-understand diagnosis in failure cases.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +"Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model",Shraman Pramanick · Guangxing Han · Rui Hou · Sayan Nag · Ser-Nam Lim · Nicolas Ballas · Qifan Wang · Rama Chellappa · Amjad Almahairi, ,https://arxiv.org/abs/2312.12423,,2312.12423.pdf,"Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model","The ability of large language models (LLMs) to process visual inputs has +given rise to general-purpose vision systems, unifying various vision-language +(VL) tasks by instruction tuning. However, due to the enormous diversity in +input-output formats in the vision domain, existing general-purpose models fail +to successfully integrate segmentation and multi-image inputs with coarse-level +tasks into a single framework. In this work, we introduce VistaLLM, a powerful +visual system that addresses coarse- and fine-grained VL tasks over single and +multiple input images using a unified framework. VistaLLM utilizes an +instruction-guided image tokenizer that filters global embeddings using task +descriptions to extract compressed and refined features from numerous images. +Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to +represent binary segmentation masks as sequences, significantly improving over +previously used uniform sampling. To bolster the desired capability of +VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning +dataset with 6.8M samples. We also address the lack of multi-image grounding +datasets by introducing a novel task, AttCoSeg (Attribute-level +Co-Segmentation), which boosts the model's reasoning and grounding capability +over multiple input images. Extensive experiments on a wide range of V- and VL +tasks demonstrate the effectiveness of VistaLLM by achieving consistent +state-of-the-art performance over strong baselines across all downstream tasks. +Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.",cs.CV,"['cs.CV', 'cs.AI']" +Spectrum AUC Difference (SAUCD): Human Aligned 3D Shape Evaluation,Tianyu Luan · Zhong Li · Lele Chen · Xuan Gong · Lichang Chen · Yi Xu · Junsong Yuan, ,https://arxiv.org/abs/2403.01619,,2403.01619.pdf,Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation,"Existing 3D mesh shape evaluation metrics mainly focus on the overall shape +but are usually less sensitive to local details. This makes them inconsistent +with human evaluation, as human perception cares about both overall and +detailed shape. In this paper, we propose an analytic metric named Spectrum +Area Under the Curve Difference (SAUCD) that demonstrates better consistency +with human evaluation. To compare the difference between two shapes, we first +transform the 3D mesh to the spectrum domain using the discrete +Laplace-Beltrami operator and Fourier transform. Then, we calculate the Area +Under the Curve (AUC) difference between the two spectrums, so that each +frequency band that captures either the overall or detailed shape is equitably +considered. Taking human sensitivity across frequency bands into account, we +further extend our metric by learning suitable weights for each frequency band +which better aligns with human perception. To measure the performance of SAUCD, +we build a 3D mesh evaluation dataset called Shape Grading, along with manual +annotations from more than 800 subjects. By measuring the correlation between +our metric and human evaluation, we demonstrate that SAUCD is well aligned with +human evaluation, and outperforms previous 3D mesh metrics.",cs.CV,"['cs.CV', 'cs.GR']" +AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation,Haonan Wang · Qixiang ZHANG · Yi Li · Xiaomeng Li, ,https://arxiv.org/abs/2403.01818,,2403.01818.pdf,AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation,"Semi-supervised semantic segmentation (SSSS) has been proposed to alleviate +the burden of time-consuming pixel-level manual labeling, which leverages +limited labeled data along with larger amounts of unlabeled data. Current +state-of-the-art methods train the labeled data with ground truths and +unlabeled data with pseudo labels. However, the two training flows are +separate, which allows labeled data to dominate the training process, resulting +in low-quality pseudo labels and, consequently, sub-optimal results. To +alleviate this issue, we present AllSpark, which reborns the labeled features +from unlabeled ones with the channel-wise cross-attention mechanism. We further +introduce a Semantic Memory along with a Channel Semantic Grouping strategy to +ensure that unlabeled features adequately represent labeled features. The +AllSpark shed new light on the architecture level designs of SSSS rather than +framework level, which avoids increasingly complicated training pipeline +designs. It can also be regarded as a flexible bottleneck module that can be +seamlessly integrated into a general transformer-based segmentation model. The +proposed AllSpark outperforms existing methods across all evaluation protocols +on Pascal, Cityscapes and COCO benchmarks without bells-and-whistles. Code and +model weights are available at: https://github.com/xmed-lab/AllSpark.",cs.CV,"['cs.CV', 'cs.AI']" +Real-Time Simulated Avatar from Head-Mounted Sensors,Zhengyi Luo · Jinkun Cao · Rawal Khirodkar · Alexander Winkler · Jing Huang · Kris Kitani · Weipeng Xu, ,https://arxiv.org/abs/2403.06862,,2403.06862.pdf,Real-Time Simulated Avatar from Head-Mounted Sensors,"We present SimXR, a method for controlling a simulated avatar from +information (headset pose and cameras) obtained from AR / VR headsets. Due to +the challenging viewpoint of head-mounted cameras, the human body is often +clipped out of view, making traditional image-based egocentric pose estimation +challenging. On the other hand, headset poses provide valuable information +about overall body motion, but lack fine-grained details about the hands and +feet. To synergize headset poses with cameras, we control a humanoid to track +headset movement while analyzing input images to decide body movement. When +body parts are seen, the movements of hands and feet will be guided by the +images; when unseen, the laws of physics guide the controller to generate +plausible motion. We design an end-to-end method that does not rely on any +intermediate representations and learns to directly map from images and headset +poses to humanoid control signals. To train our method, we also propose a +large-scale synthetic dataset created using camera configurations compatible +with a commercially available VR headset (Quest 2) and show promising results +on real-world captures. To demonstrate the applicability of our framework, we +also test it on an AR headset with a forward-facing camera.",cs.CV,"['cs.CV', 'cs.GR', 'cs.RO']" +Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving,Mozhgan Pourkeshavarz · Mohammad Sabokrou · Amir Rasouli, ,https://arxiv.org/abs/2306.15755,,2306.15755.pdf,Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving,"In autonomous driving, behavior prediction is fundamental for safe motion +planning, hence the security and robustness of prediction models against +adversarial attacks are of paramount importance. We propose a novel adversarial +backdoor attack against trajectory prediction models as a means of studying +their potential vulnerabilities. Our attack affects the victim at training time +via naturalistic, hence stealthy, poisoned samples crafted using a novel +two-step approach. First, the triggers are crafted by perturbing the trajectory +of attacking vehicle and then disguised by transforming the scene using a +bi-level optimization technique. The proposed attack does not depend on a +particular model architecture and operates in a black-box manner, thus can be +effective without any knowledge of the victim model. We conduct extensive +empirical studies using state-of-the-art prediction models on two benchmark +datasets using metrics customized for trajectory prediction. We show that the +proposed attack is highly effective, as it can significantly hinder the +performance of prediction models, unnoticeable by the victims, and efficient as +it forces the victim to generate malicious behavior even under constrained +conditions. Via ablative studies, we analyze the impact of different attack +design choices followed by an evaluation of existing defence mechanisms against +the proposed attack.",cs.CV,['cs.CV'] +MAPSeg: Unified Unsupervised Domain Adaptation for Heterogeneous Medical Image Segmentation Based on 3D Masked Autoencoding and Pseudo-Labeling,Xuzhe Zhang · Yuhao Wu · Elsa Angelini · Ang Li · Jia Guo · Jerod Rasmussen · Thomas O'Connor · Pathik Wadhwa · Andrea Jackowski · Hai Li · Jonathan Posner · Andrew Laine · YUN WANG · Yun Wang,https://github.com/XuzheZ/MAPSeg,,https://www.researchgate.net/publication/378738417_MAPSeg_Unified_Unsupervised_Domain_Adaptation_for_Heterogeneous_Medical_Image_Segmentation_Based_on_3D_Masked_Autoencoding_and_Pseudo-Labeling,,,,,nan +KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling,Yu Wang · Xin Li · Shengzhao Wen · gang zhang · Haixiao Yue · Haocheng Feng · Junyu Han · Errui Ding, ,https://arxiv.org/abs/2311.13657,,2311.13657.pdf,Efficient Transformer Knowledge Distillation: A Performance Review,"As pretrained transformer language models continue to achieve +state-of-the-art performance, the Natural Language Processing community has +pushed for advances in model compression and efficient attention mechanisms to +address high computational requirements and limited input sequence length. +Despite these separate efforts, no investigation has been done into the +intersection of these two fields. In this work, we provide an evaluation of +model compression via knowledge distillation on efficient attention +transformers. We provide cost-performance trade-offs for the compression of +state-of-the-art efficient attention architectures and the gains made in +performance in comparison to their full attention counterparts. Furthermore, we +introduce a new long-context Named Entity Recognition dataset, GONERD, to train +and test the performance of NER models on long sequences. We find that +distilled efficient attention transformers can preserve a significant amount of +original model performance, preserving up to 98.6% across short-context tasks +(GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context +Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on +long-context Named Entity Recognition (GONERD), while decreasing inference +times by up to 57.8%. We find that, for most models on most tasks, performing +knowledge distillation is an effective method to yield high-performing +efficient attention models with low costs.",cs.CL,"['cs.CL', 'cs.LG']" +Point-VOS: Pointing Up Video Object Segmentation,Sabarinath Mahadevan · Idil Esen Zulfikar · Paul Voigtlaender · Bastian Leibe, ,https://arxiv.org/abs/2402.05917v1,,2402.05917v1.pdf,Point-VOS: Pointing Up Video Object Segmentation,"Current state-of-the-art Video Object Segmentation (VOS) methods rely on +dense per-object mask annotations both during training and testing. This +requires time-consuming and costly video annotation mechanisms. We propose a +novel Point-VOS task with a spatio-temporally sparse point-wise annotation +scheme that substantially reduces the annotation effort. We apply our +annotation scheme to two large-scale video datasets with text descriptions and +annotate over 19M points across 133K objects in 32K videos. Based on our +annotations, we propose a new Point-VOS benchmark, and a corresponding +point-based training mechanism, which we use to establish strong baseline +results. We show that existing VOS methods can easily be adapted to leverage +our point annotations during training, and can achieve results close to the +fully-supervised performance when trained on pseudo-masks generated from these +points. In addition, we show that our data can be used to improve models that +connect vision and language, by evaluating it on the Video Narrative Grounding +(VNG) task. We will make our code and annotations available at +https://pointvos.github.io.",cs.CV,['cs.CV'] +DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing,Yujun Shi · Chuhui Xue · Jun Hao Liew · Jiachun Pan · Hanshu Yan · Wenqing Zhang · Vincent Y. F. Tan · Song Bai, ,https://arxiv.org/abs/2306.14435,,2306.14435.pdf,DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing,"Accurate and controllable image editing is a challenging task that has +attracted significant attention recently. Notably, DragGAN is an interactive +point-based image editing framework that achieves impressive editing results +with pixel-level precision. However, due to its reliance on generative +adversarial networks (GANs), its generality is limited by the capacity of +pretrained GAN models. In this work, we extend this editing framework to +diffusion models and propose a novel approach DragDiffusion. By harnessing +large-scale pretrained diffusion models, we greatly enhance the applicability +of interactive point-based editing on both real and diffusion-generated images. +Our approach involves optimizing the diffusion latents to achieve precise +spatial control. The supervision signal of this optimization process is from +the diffusion model's UNet features, which are known to contain rich semantic +and geometric information. Moreover, we introduce two additional techniques, +namely LoRA fine-tuning and latent-MasaCtrl, to further preserve the identity +of the original image. Lastly, we present a challenging benchmark dataset +called DragBench -- the first benchmark to evaluate the performance of +interactive point-based image editing methods. Experiments across a wide range +of challenging cases (e.g., images with multiple objects, diverse object +categories, various styles, etc.) demonstrate the versatility and generality of +DragDiffusion. Code: https://github.com/Yujun-Shi/DragDiffusion.",cs.CV,"['cs.CV', 'cs.LG']" +Revisiting Adversarial Training at Scale,Zeyu Wang · Xianhang li · Hongru Zhu · Cihang Xie, ,https://arxiv.org/abs/2401.04727,,2401.04727.pdf,Revisiting Adversarial Training at Scale,"The machine learning community has witnessed a drastic change in the training +pipeline, pivoted by those ''foundation models'' with unprecedented scales. +However, the field of adversarial training is lagging behind, predominantly +centered around small model sizes like ResNet-50, and tiny and low-resolution +datasets like CIFAR-10. To bridge this transformation gap, this paper provides +a modern re-examination with adversarial training, investigating its potential +benefits when applied at scale. Additionally, we introduce an efficient and +effective training strategy to enable adversarial training with giant models +and web-scale data at an affordable computing cost. We denote this newly +introduced framework as AdvXL. + Empirical results demonstrate that AdvXL establishes new state-of-the-art +robust accuracy records under AutoAttack on ImageNet-1K. For example, by +training on DataComp-1B dataset, our AdvXL empowers a vanilla ViT-g model to +substantially surpass the previous records of $l_{\infty}$-, $l_{2}$-, and +$l_{1}$-robust accuracy by margins of 11.4%, 14.2% and 12.9%, respectively. +This achievement posits AdvXL as a pioneering approach, charting a new +trajectory for the efficient training of robust visual representations at +significantly larger scales. Our code is available at +https://github.com/UCSC-VLAA/AdvXL.",cs.CV,['cs.CV'] +Seeing Motion at Nighttime with an Event Camera,Haoyue Liu · Shihan Peng · Lin Zhu · Yi Chang · Hanyu Zhou · Luxin Yan,https://github.com/Liu-haoyue/NER-Net,https://arxiv.org/abs/2404.11884,,2404.11884.pdf,Seeing Motion at Nighttime with an Event Camera,"We focus on a very challenging task: imaging at nighttime dynamic scenes. +Most previous methods rely on the low-light enhancement of a conventional RGB +camera. However, they would inevitably face a dilemma between the long exposure +time of nighttime and the motion blur of dynamic scenes. Event cameras react to +dynamic changes with higher temporal resolution (microsecond) and higher +dynamic range (120dB), offering an alternative solution. In this work, we +present a novel nighttime dynamic imaging method with an event camera. +Specifically, we discover that the event at nighttime exhibits temporal +trailing characteristics and spatial non-stationary distribution. Consequently, +we propose a nighttime event reconstruction network (NER-Net) which mainly +includes a learnable event timestamps calibration module (LETC) to align the +temporal trailing events and a non-uniform illumination aware module (NIAM) to +stabilize the spatiotemporal distribution of events. Moreover, we construct a +paired real low-light event dataset (RLED) through a co-axial imaging system, +including 64,200 spatially and temporally aligned image GTs and low-light +events. Extensive experiments demonstrate that the proposed method outperforms +state-of-the-art methods in terms of visual quality and generalization ability +on real-world nighttime datasets. The project are available at: +https://github.com/Liu-haoyue/NER-Net.",cs.CV,['cs.CV'] +Generative Unlearning for Any Identity,Juwon Seo · Sung-Hoon Lee · Tae-Young Lee · SeungJun Moon · Gyeong-Moon Park, ,https://arxiv.org/abs/2405.09879,,2405.09879.pdf,Generative Unlearning for Any Identity,"Recent advances in generative models trained on large-scale datasets have +made it possible to synthesize high-quality samples across various domains. +Moreover, the emergence of strong inversion networks enables not only a +reconstruction of real-world images but also the modification of attributes +through various editing methods. However, in certain domains related to privacy +issues, e.g., human faces, advanced generative models along with strong +inversion methods can lead to potential misuses. In this paper, we propose an +essential yet under-explored task called generative identity unlearning, which +steers the model not to generate an image of a specific identity. In the +generative identity unlearning, we target the following objectives: (i) +preventing the generation of images with a certain identity, and (ii) +preserving the overall quality of the generative model. To satisfy these goals, +we propose a novel framework, Generative Unlearning for Any Identity (GUIDE), +which prevents the reconstruction of a specific identity by unlearning the +generator with only a single image. GUIDE consists of two parts: (i) finding a +target point for optimization that un-identifies the source latent code and +(ii) novel loss functions that facilitate the unlearning procedure while less +affecting the learned distribution. Our extensive experiments demonstrate that +our proposed method achieves state-of-the-art performance in the generative +machine unlearning task. The code is available at +https://github.com/KHU-AGI/GUIDE.",cs.CV,"['cs.CV', 'cs.AI']" +OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM,Yutao Hu · Yutao Hu · Tianbin · Quanfeng Lu · Wenqi Shao · Junjun He · Yu Qiao · Ping Luo, ,https://arxiv.org/abs/2402.09181,,2402.09181.pdf,OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM,"Large Vision-Language Models (LVLMs) have demonstrated remarkable +capabilities in various multimodal tasks. However, their potential in the +medical domain remains largely unexplored. A significant challenge arises from +the scarcity of diverse medical images spanning various modalities and +anatomical regions, which is essential in real-world medical applications. To +solve this problem, in this paper, we introduce OmniMedVQA, a novel +comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark +is collected from 73 different medical datasets, including 12 different +modalities and covering more than 20 distinct anatomical regions. Importantly, +all images in this benchmark are sourced from authentic medical scenarios, +ensuring alignment with the requirements of the medical field and suitability +for evaluating LVLMs. Through our extensive experiments, we have found that +existing LVLMs struggle to address these medical VQA problems effectively. +Moreover, what surprises us is that medical-specialized LVLMs even exhibit +inferior performance to those general-domain models, calling for a more +versatile and robust LVLM in the biomedical field. The evaluation results not +only reveal the current limitations of LVLM in understanding real medical +images but also highlight our dataset's significance. Our code with dataset are +available at https://github.com/OpenGVLab/Multi-Modality-Arena.",eess.IV,"['eess.IV', 'cs.CV']" +Sequential Modeling Enables Scalable Learning for Large Vision Models,Yutong Bai · Xinyang Geng · Xinyang Geng · Karttikeya Mangalam · Amir Bar · Alan L. Yuille · Trevor Darrell · Jitendra Malik · Alexei A. Efros, ,https://arxiv.org/abs/2312.00785,,2312.00785.pdf,Sequential Modeling Enables Scalable Learning for Large Vision Models,"We introduce a novel sequential modeling approach which enables learning a +Large Vision Model (LVM) without making use of any linguistic data. To do this, +we define a common format, ""visual sentences"", in which we can represent raw +images and videos as well as annotated data sources such as semantic +segmentations and depth reconstructions without needing any meta-knowledge +beyond the pixels. Once this wide variety of visual data (comprising 420 +billion tokens) is represented as sequences, the model can be trained to +minimize a cross-entropy loss for next token prediction. By training across +various scales of model architecture and data diversity, we provide empirical +evidence that our models scale effectively. Many different vision tasks can be +solved by designing suitable visual prompts at test time.",cs.CV,['cs.CV'] +An edit friendly ddpm noise space: inversion and manipulations,Inbar Huberman-Spiegelglas · Vladimir Kulikov · Tomer Michaeli, ,https://ar5iv.labs.arxiv.org/html/2307.00522,,2307.00522.pdf,LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance,"Recent large-scale text-guided diffusion models provide powerful +image-generation capabilities. Currently, a significant effort is given to +enable the modification of these images using text only as means to offer +intuitive and versatile editing. However, editing proves to be difficult for +these generative models due to the inherent nature of editing techniques, which +involves preserving certain content from the original image. Conversely, in +text-based models, even minor modifications to the text prompt frequently +result in an entirely distinct result, making attaining one-shot generation +that accurately corresponds to the users intent exceedingly challenging. In +addition, to edit a real image using these state-of-the-art tools, one must +first invert the image into the pre-trained models domain - adding another +factor affecting the edit quality, as well as latency. In this exploratory +report, we propose LEDITS - a combined lightweight approach for real-image +editing, incorporating the Edit Friendly DDPM inversion technique with Semantic +Guidance, thus extending Semantic Guidance to real image editing, while +harnessing the editing capabilities of DDPM inversion as well. This approach +achieves versatile edits, both subtle and extensive as well as alterations in +composition and style, while requiring no optimization nor extensions to the +architecture.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes,Alexandros Delitzas · Ayça Takmaz · Federico Tombari · Robert Sumner · Marc Pollefeys · Francis Engelmann,https://scenefun3d.github.io,https://arxiv.org/html/2404.03650v1,,2404.03650v1.pdf,OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views,"Large visual-language models (VLMs), like CLIP, enable open-set image +segmentation to segment arbitrary concepts from an image in a zero-shot manner. +This goes beyond the traditional closed-set assumption, i.e., where models can +only segment classes from a pre-defined training set. More recently, first +works on open-set segmentation in 3D scenes have appeared in the literature. +These methods are heavily influenced by closed-set 3D convolutional approaches +that process point clouds or polygon meshes. However, these 3D scene +representations do not align well with the image-based nature of the +visual-language models. Indeed, point cloud and 3D meshes typically have a +lower resolution than images and the reconstructed 3D scene geometry might not +project well to the underlying 2D image sequences used to compute pixel-aligned +CLIP features. To address these challenges, we propose OpenNeRF which naturally +operates on posed images and directly encodes the VLM features within the NeRF. +This is similar in spirit to LERF, however our work shows that using pixel-wise +VLM features (instead of global CLIP features) results in an overall less +complex architecture without the need for additional DINO regularization. Our +OpenNeRF further leverages NeRF's ability to render novel views and extract +open-set VLM features from areas that are not well observed in the initial +posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF +outperforms recent open-vocabulary methods such as LERF and OpenScene by at +least +4.9 mIoU.",cs.CV,['cs.CV'] +Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields,Shijie Zhou · Haoran Chang · Sicheng Jiang · Zhiwen Fan · Zehao Zhu · Dejia Xu · Dejia Xu · Pradyumna Chari · Suya You · Zhangyang Wang · Achuta Kadambi, ,https://arxiv.org/abs/2312.03203,,2312.03203.pdf,Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields,"3D scene representations have gained immense popularity in recent years. +Methods that use Neural Radiance fields are versatile for traditional tasks +such as novel view synthesis. In recent times, some work has emerged that aims +to extend the functionality of NeRF beyond view synthesis, for semantically +aware tasks such as editing and segmentation using 3D feature field +distillation from 2D foundation models. However, these methods have two major +limitations: (a) they are limited by the rendering speed of NeRF pipelines, and +(b) implicitly represented feature fields suffer from continuity artifacts +reducing feature quality. Recently, 3D Gaussian Splatting has shown +state-of-the-art performance on real-time radiance field rendering. In this +work, we go one step further: in addition to radiance field rendering, we +enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D +foundation model distillation. This translation is not straightforward: naively +incorporating feature fields in the 3DGS framework encounters significant +challenges, notably the disparities in spatial resolution and channel +consistency between RGB images and feature maps. We propose architectural and +training changes to efficiently avert this problem. Our proposed method is +general, and our experiments showcase novel view semantic segmentation, +language-guided editing and segment anything through learning feature fields +from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across +experiments, our distillation method is able to provide comparable or better +results, while being significantly faster to both train and render. +Additionally, to the best of our knowledge, we are the first method to enable +point and bounding-box prompting for radiance field manipulation, by leveraging +the SAM model. Project website at: https://feature-3dgs.github.io/",cs.CV,['cs.CV'] +Taming Mode Collapse in Score Distillation for Text-to-3D Generation,Peihao Wang · Dejia Xu · Dejia Xu · Zhiwen Fan · Dilin Wang · Sreyas Mohan · Forrest Iandola · Rakesh Ranjan · Yilei Li · Qiang Liu · Zhangyang Wang · Vikas Chandra, ,https://arxiv.org/abs/2401.00909,,2401.00909.pdf,Taming Mode Collapse in Score Distillation for Text-to-3D Generation,"Despite the remarkable performance of score distillation in text-to-3D +generation, such techniques notoriously suffer from view inconsistency issues, +also known as ""Janus"" artifact, where the generated objects fake each view with +multiple front faces. Although empirically effective methods have approached +this problem via score debiasing or prompt engineering, a more rigorous +perspective to explain and tackle this problem remains elusive. In this paper, +we reveal that the existing score distillation-based text-to-3D generation +frameworks degenerate to maximal likelihood seeking on each view independently +and thus suffer from the mode collapse problem, manifesting as the Janus +artifact in practice. To tame mode collapse, we improve score distillation by +re-establishing the entropy term in the corresponding variational objective, +which is applied to the distribution of rendered images. Maximizing the entropy +encourages diversity among different views in generated 3D assets, thereby +mitigating the Janus problem. Based on this new objective, we derive a new +update rule for 3D score distillation, dubbed Entropic Score Distillation +(ESD). We theoretically reveal that ESD can be simplified and implemented by +just adopting the classifier-free guidance trick upon variational score +distillation. Although embarrassingly straightforward, our extensive +experiments successfully demonstrate that ESD can be an effective treatment for +Janus artifacts in score distillation.",cs.CV,"['cs.CV', 'cs.LG']" +LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction,Linqing Zhao · Xiuwei Xu · Ziwei Wang · Yunpeng Zhang · Borui Zhang · Wenzhao Zheng · Dalong Du · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2405.17429,,2405.17429.pdf,GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction,"3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and +semantics of the surrounding scene and is an important task for the robustness +of vision-centric autonomous driving. Most existing methods employ dense grids +such as voxels as scene representations, which ignore the sparsity of occupancy +and the diversity of object scales and thus lead to unbalanced allocation of +resources. To address this, we propose an object-centric representation to +describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian +represents a flexible region of interest and its semantic features. We +aggregate information from images through the attention mechanism and +iteratively refine the properties of 3D Gaussians including position, +covariance, and semantics. We then propose an efficient Gaussian-to-voxel +splatting method to generate 3D occupancy predictions, which only aggregates +the neighboring Gaussians for a certain position. We conduct extensive +experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental +results demonstrate that GaussianFormer achieves comparable performance with +state-of-the-art methods with only 17.8% - 24.8% of their memory consumption. +Code is available at: https://github.com/huang-yh/GaussianFormer.",cs.CV,"['cs.CV', 'cs.AI']" +mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,Qinghao Ye · Haiyang Xu · Jiabo Ye · Ming Yan · Anwen Hu · Haowei Liu · Qi Qian · Ji Zhang · Fei Huang · Fei Huang, ,https://arxiv.org/abs/2311.04257,,2311.04257.pdf,mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,"Multi-modal Large Language Models (MLLMs) have demonstrated impressive +instruction abilities across various open-ended tasks. However, previous +methods primarily focus on enhancing multi-modal capabilities. In this work, we +introduce a versatile multi-modal large language model, mPLUG-Owl2, which +effectively leverages modality collaboration to improve performance in both +text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, +with the language decoder acting as a universal interface for managing +different modalities. Specifically, mPLUG-Owl2 incorporates shared functional +modules to facilitate modality collaboration and introduces a modality-adaptive +module that preserves modality-specific features. Extensive experiments reveal +that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal +tasks and achieving state-of-the-art performances with a single generic model. +Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality +collaboration phenomenon in both pure-text and multi-modal scenarios, setting a +pioneering path in the development of future multi-modal foundation models.",cs.CL,"['cs.CL', 'cs.CV']" +NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models,Yusuf Dalva · Pinar Yanardag, ,https://arxiv.org/abs/2312.05390,,2312.05390.pdf,NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models,"Generative models have been very popular in the recent years for their image +generation capabilities. GAN-based models are highly regarded for their +disentangled latent space, which is a key feature contributing to their success +in controlled image editing. On the other hand, diffusion models have emerged +as powerful tools for generating high-quality images. However, the latent space +of diffusion models is not as thoroughly explored or understood. Existing +methods that aim to explore the latent space of diffusion models usually relies +on text prompts to pinpoint specific semantics. However, this approach may be +restrictive in areas such as art, fashion, or specialized fields like medicine, +where suitable text prompts might not be available or easy to conceive thus +limiting the scope of existing work. In this paper, we propose an unsupervised +method to discover latent semantics in text-to-image diffusion models without +relying on text prompts. Our method takes a small set of unlabeled images from +specific domains, such as faces or cats, and a pre-trained diffusion model, and +discovers diverse semantics in unsupervised fashion using a contrastive +learning objective. Moreover, the learned directions can be applied +simultaneously, either within the same domain (such as various types of facial +edits) or across different domains (such as applying cat and face edits within +the same image) without interfering with each other. Our extensive experiments +show that our method achieves highly disentangled edits, outperforming existing +approaches in both diffusion-based and GAN-based latent space editing methods.",cs.CV,['cs.CV'] +On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving,Kaituo Feng · Changsheng Li · Dongchun Ren · Ye Yuan · Guoren Wang, ,https://arxiv.org/abs/2403.01238,,2403.01238.pdf,On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving,"End-to-end motion planning models equipped with deep neural networks have +shown great potential for enabling full autonomous driving. However, the +oversized neural networks render them impractical for deployment on +resource-constrained systems, which unavoidably requires more computational +time and resources during reference.To handle this, knowledge distillation +offers a promising approach that compresses models by enabling a smaller +student model to learn from a larger teacher model. Nevertheless, how to apply +knowledge distillation to compress motion planners has not been explored so +far. In this paper, we propose PlanKD, the first knowledge distillation +framework tailored for compressing end-to-end motion planners. First, +considering that driving scenes are inherently complex, often containing +planning-irrelevant or even noisy information, transferring such information is +not beneficial for the student planner. Thus, we design an information +bottleneck based strategy to only distill planning-relevant information, rather +than transfer all information indiscriminately. Second, different waypoints in +an output planned trajectory may hold varying degrees of importance for motion +planning, where a slight deviation in certain crucial waypoints might lead to a +collision. Therefore, we devise a safety-aware waypoint-attentive distillation +module that assigns adaptive weights to different waypoints based on the +importance, to encourage the student to accurately mimic more crucial +waypoints, thereby improving overall safety. Experiments demonstrate that our +PlanKD can boost the performance of smaller planners by a large margin, and +significantly reduce their reference time.",cs.CV,['cs.CV'] +Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning,Zichen Miao · Jiang Wang · Ze Wang · Zhengyuan Yang · Lijuan Wang · Qiang Qiu · Zicheng Liu, ,,https://bair.berkeley.edu/blog/2023/07/14/ddpo/,,,,,nan +HDQMF: Holographic Feature Decomposition Using Quantum Algorithms,Prathyush Poduval · Zhuowen Zou · Mohsen Imani, ,https://arxiv.org/abs/2403.17444,,,Quantum accelerated cross regression algorithm for multiview feature extraction,"Multi-view Feature Extraction (MvFE) has wide applications in machine +learning, image processing and other fields. When dealing with massive +high-dimensional data, the performance of classical computer faces severe +challenges due to MvFE involves expensive matrix calculation. To address this +challenge, a quantum-accelerated cross-regression algorithm for MvFE is +proposed. The main contributions are as follows:(1) a quantum version algorithm +for MvFE is proposed for the first time, filling the gap of quantum computing +in the field of MvFE;(2) a quantum algorithm is designed to construct the +block-encoding of the target data matrix, so that the optimal Hamiltonian +simulation technology based on the block-encoding framework can be used to +efficiently realize the quantum simulation of the target data matrix. This +approach reduces the dependence of the algorithm's on simulation errors to +enhance algorithm performance;(3) compared with the classical counterpart +algorithm, the proposed quantum algorithm has a polynomial acceleration in the +number of data points, the dimension of data points and the number of view +data.",quant-ph,['quant-ph'] +Leveraging Predicate and Triplet Learning for Scene Graph Generation,Jiankai Li · Yunhong Wang · Xiefan Guo · Ruijie Yang · Weixin Li, ,https://arxiv.org/abs/2309.03542,,2309.03542.pdf,Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction,"Scene Graph Generation (SGG) plays a pivotal role in downstream +vision-language tasks. Existing SGG methods typically suffer from poor +compositional generalizations on unseen triplets. They are generally trained on +incompletely annotated scene graphs that contain dominant triplets and tend to +bias toward these seen triplets during inference. To address this issue, we +propose a Triplet Calibration and Reduction (T-CAR) framework in this paper. In +our framework, a triplet calibration loss is first presented to regularize the +representations of diverse triplets and to simultaneously excavate the unseen +triplets in incompletely annotated training scene graphs. Moreover, the unseen +space of scene graphs is usually several times larger than the seen space since +it contains a huge number of unrealistic compositions. Thus, we propose an +unseen space reduction loss to shift the attention of excavation to reasonable +unseen compositions to facilitate the model training. Finally, we propose a +contextual encoder to improve the compositional generalizations of unseen +triplets by explicitly modeling the relative spatial relations between subjects +and objects. Extensive experiments show that our approach achieves consistent +improvements for zero-shot SGG over state-of-the-art methods. The code is +available at https://github.com/jkli1998/T-CAR.",cs.CV,"['cs.CV', 'cs.MM']" +Open-vocabulary object 6D pose estimation,Jaime Corsetti · Davide Boscaini · Changjae Oh · Andrea Cavallaro · Fabio Poiesi, ,https://arxiv.org/abs/2312.00690v2,,2312.00690v2.pdf,Open-vocabulary object 6D pose estimation,"We introduce the new setting of open-vocabulary object 6D pose estimation, in +which a textual prompt is used to specify the object of interest. In contrast +to existing approaches, in our setting (i) the object of interest is specified +solely through the textual prompt, (ii) no object model (e.g. CAD or video +sequence) is required at inference, (iii) the object is imaged from two +different viewpoints of two different scenes, and (iv) the object was not +observed during the training phase. To operate in this setting, we introduce a +novel approach that leverages a Vision-Language Model to segment the object of +interest from two distinct scenes and to estimate its relative 6D pose. The key +of our approach is a carefully devised strategy to fuse object-level +information provided by the prompt with local image features, resulting in a +feature space that can generalize to novel concepts. We validate our approach +on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, +which collectively encompass 39 object instances appearing in four thousand +image pairs. The results demonstrate that our approach outperforms both a +well-established hand-crafted method and a recent deep learning-based baseline +in estimating the relative 6D pose of objects in different scenes. Project +page: https://jcorsetti.github.io/oryon/.",cs.CV,['cs.CV'] +Matching Anything by Segmenting Anything,Siyuan Li · Lei Ke · Martin Danelljan · Luigi Piccinelli · Mattia Segu · Luc Van Gool · Fisher Yu, ,https://arxiv.org/abs/2401.16741v1,,,MESA: Matching Everything by Segmenting Anything,"Feature matching is a crucial task in the field of computer vision, which +involves finding correspondences between images. Previous studies achieve +remarkable performance using learning-based feature comparison. However, the +pervasive presence of matching redundancy between images gives rise to +unnecessary and error-prone computations in these methods, imposing limitations +on their accuracy. To address this issue, we propose MESA, a novel approach to +establish precise area (or region) matches for efficient matching redundancy +reduction. MESA first leverages the advanced image understanding capability of +SAM, a state-of-the-art foundation model for image segmentation, to obtain +image areas with implicit semantic. Then, a multi-relational graph is proposed +to model the spatial structure of these areas and construct their scale +hierarchy. Based on graphical models derived from the graph, the area matching +is reformulated as an energy minimization task and effectively resolved. +Extensive experiments demonstrate that MESA yields substantial precision +improvement for multiple point matchers in indoor and outdoor downstream tasks, +e.g. +13.61% for DKM in indoor pose estimation.",cs.CV,['cs.CV'] +DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation,Xiaoliang Ju · Zhaoyang Huang · Yijin Li · Guofeng Zhang · Yu Qiao · Hongsheng Li, ,https://ar5iv.labs.arxiv.org/html/2311.17261,,,SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors,"We propose SceneTex, a novel method for effectively generating high-quality +and style-consistent textures for indoor scenes using depth-to-image diffusion +priors. Unlike previous methods that either iteratively warp 2D views onto a +mesh surface or distillate diffusion latent features without accurate geometric +and style cues, SceneTex formulates the texture synthesis task as an +optimization problem in the RGB space where style and geometry consistency are +properly reflected. At its core, SceneTex proposes a multiresolution texture +field to implicitly encode the mesh appearance. We optimize the target texture +via a score-distillation-based objective function in respective RGB renderings. +To further secure the style consistency across views, we introduce a +cross-attention decoder to predict the RGB values by cross-attending to the +pre-sampled reference locations in each instance. SceneTex enables various and +accurate texture synthesis for 3D-FRONT scenes, demonstrating significant +improvements in visual quality and prompt fidelity over the prior texture +generation methods.",cs.CV,['cs.CV'] +DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection,Yuhao Sun · Lingyun Yu · Hongtao Xie · Jiaming Li · Yongdong Zhang, ,http://export.arxiv.org/abs/2405.09882,,2405.09882.pdf,DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection,"With the rapid development of face recognition (FR) systems, the privacy of +face images on social media is facing severe challenges due to the abuse of +unauthorized FR systems. Some studies utilize adversarial attack techniques to +defend against malicious FR systems by generating adversarial examples. +However, the generated adversarial examples, i.e., the protected face images, +tend to suffer from subpar visual quality and low transferability. In this +paper, we propose a novel face protection approach, dubbed DiffAM, which +leverages the powerful generative ability of diffusion models to generate +high-quality protected face images with adversarial makeup transferred from +reference images. To be specific, we first introduce a makeup removal module to +generate non-makeup images utilizing a fine-tuned diffusion model with guidance +of textual prompts in CLIP space. As the inverse process of makeup transfer, +makeup removal can make it easier to establish the deterministic relationship +between makeup domain and non-makeup domain regardless of elaborate text +prompts. Then, with this relationship, a CLIP-based makeup loss along with an +ensemble attack strategy is introduced to jointly guide the direction of +adversarial makeup domain, achieving the generation of protected face images +with natural-looking makeup and high black-box transferability. Extensive +experiments demonstrate that DiffAM achieves higher visual quality and attack +success rates with a gain of 12.98% under black-box setting compared with the +state of the arts. The code will be available at +https://github.com/HansSunY/DiffAM.",cs.CV,"['cs.CV', 'cs.AI']" +MoMask: Generative Masked Modeling of 3D Human Motions,chuan guo · Yuxuan Mu · Muhammad Gohar Javed · Sen Wang · Li Cheng, ,https://arxiv.org/abs/2312.00063,,2312.00063.pdf,MoMask: Generative Masked Modeling of 3D Human Motions,"We introduce MoMask, a novel masked modeling framework for text-driven 3D +human motion generation. In MoMask, a hierarchical quantization scheme is +employed to represent human motion as multi-layer discrete motion tokens with +high-fidelity details. Starting at the base layer, with a sequence of motion +tokens obtained by vector quantization, the residual tokens of increasing +orders are derived and stored at the subsequent layers of the hierarchy. This +is consequently followed by two distinct bidirectional transformers. For the +base-layer motion tokens, a Masked Transformer is designated to predict +randomly masked motion tokens conditioned on text input at training stage. +During generation (i.e. inference) stage, starting from an empty sequence, our +Masked Transformer iteratively fills up the missing tokens; Subsequently, a +Residual Transformer learns to progressively predict the next-layer tokens +based on the results from current layer. Extensive experiments demonstrate that +MoMask outperforms the state-of-art methods on the text-to-motion generation +task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, +and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly +applied in related tasks without further model fine-tuning, such as text-guided +temporal inpainting.",cs.CV,['cs.CV'] +Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement,Zaid Khan · Vijay Kumar BG · Samuel Schulter · Yun Fu · Manmohan Chandraker, ,https://arxiv.org/abs/2404.04627,,2404.04627.pdf,Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement,"Visual program synthesis is a promising approach to exploit the reasoning +abilities of large language models for compositional computer vision tasks. +Previous work has used few-shot prompting with frozen LLMs to synthesize visual +programs. Training an LLM to write better visual programs is an attractive +prospect, but it is unclear how to accomplish this. No dataset of visual +programs for training exists, and acquisition of a visual program dataset +cannot be easily crowdsourced due to the need for expert annotators. To get +around the lack of direct supervision, we explore improving the program +synthesis abilities of an LLM using feedback from interactive experience. We +propose a method where we exploit existing annotations for a vision-language +task to improvise a coarse reward signal for that task, treat the LLM as a +policy, and apply reinforced self-training to improve the visual program +synthesis ability of the LLM for that task. We describe a series of experiments +on object detection, compositional visual question answering, and image-text +retrieval, and show that in each case, the self-trained LLM outperforms or +performs on par with few-shot frozen LLMs that are an order of magnitude +larger. Website: https://zaidkhan.me/ViReP",cs.CV,['cs.CV'] +Scaling Laws of Synthetic Images for Model Training ... for Now,Lijie Fan · Kaifeng Chen · Dilip Krishnan · Dina Katabi · Phillip Isola · Yonglong Tian,https://github.com/google-research/syn-rep-learn/tree/main/Scaling,https://arxiv.org/abs/2312.04567,,2312.04567.pdf,Scaling Laws of Synthetic Images for Model Training ... for Now,"Recent significant advances in text-to-image models unlock the possibility of +training vision systems using synthetic images, potentially overcoming the +difficulty of collecting curated data at scale. It is unclear, however, how +these models behave at scale, as more synthetic data is added to the training +set. In this paper we study the scaling laws of synthetic images generated by +state of the art text-to-image models, for the training of supervised models: +image classifiers with label supervision, and CLIP with language supervision. +We identify several factors, including text prompts, classifier-free guidance +scale, and types of text-to-image models, that significantly affect scaling +behavior. After tuning these factors, we observe that synthetic images +demonstrate a scaling trend similar to, but slightly less effective than, real +images in CLIP training, while they significantly underperform in scaling when +training supervised image classifiers. Our analysis indicates that the main +reason for this underperformance is the inability of off-the-shelf +text-to-image models to generate certain concepts, a limitation that +significantly impairs the training of image classifiers. Our findings also +suggest that scaling synthetic data can be particularly effective in scenarios +such as: (1) when there is a limited supply of real images for a supervised +problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the +evaluation dataset diverges significantly from the training data, indicating +the out-of-distribution scenario, or (3) when synthetic data is used in +conjunction with real images, as demonstrated in the training of CLIP models.",cs.CV,['cs.CV'] +Adaptive Hyper-graph Aggregation for Modality-Agnostic Federated Learning,Fan Qi · Shuai Li, ,,https://ieeexplore.ieee.org/document/10528890,,,,,nan +DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling,Xiaoyun Zheng · Liwei Liao · Xufeng Li · Jianbo Jiao · Rongjie Wang · Feng Gao · Shiqi Wang · Ronggang Wang,https://pku-dymvhumans.github.io/,https://arxiv.org/abs/2403.16080,,2403.16080.pdf,PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling,"High-quality human reconstruction and photo-realistic rendering of a dynamic +scene is a long-standing problem in computer vision and graphics. Despite +considerable efforts invested in developing various capture systems and +reconstruction algorithms, recent advancements still struggle with loose or +oversized clothing and overly complex poses. In part, this is due to the +challenges of acquiring high-quality human datasets. To facilitate the +development of these fields, in this paper, we present PKU-DyMVHumans, a +versatile human-centric dataset for high-fidelity reconstruction and rendering +of dynamic human scenarios from dense multi-view videos. It comprises 8.2 +million frames captured by more than 56 synchronized cameras across diverse +scenarios. These sequences comprise 32 human subjects across 45 different +scenarios, each with a high-detailed appearance and realistic human motion. +Inspired by recent advancements in neural radiance field (NeRF)-based scene +representations, we carefully set up an off-the-shelf framework that is easy to +provide those state-of-the-art NeRF-based implementations and benchmark on +PKU-DyMVHumans dataset. It is paving the way for various applications like +fine-grained foreground/background decomposition, high-quality human +reconstruction and photo-realistic novel view synthesis of a dynamic scene. +Extensive studies are performed on the benchmark, demonstrating new +observations and challenges that emerge from using such high-fidelity dynamic +data.",cs.CV,['cs.CV'] +CrossMAE: Cross Modality Masked Autoencoders For Region-Aware Audio-Visual Pre-Training,Yuxin Guo · Siyang Sun · Shuailei Ma · Kecheng Zheng · Xiaoyi Bao · Shijie Ma · Wei Zou · Yun Zheng, ,https://arxiv.org/abs/2401.14391,,2401.14391.pdf,Rethinking Patch Dependence for Masked Autoencoders,"In this work, we re-examine inter-patch dependencies in the decoding +mechanism of masked autoencoders (MAE). We decompose this decoding mechanism +for masked patch reconstruction in MAE into self-attention and cross-attention. +Our investigations suggest that self-attention between mask patches is not +essential for learning good representations. To this end, we propose a novel +pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). +CrossMAE's decoder leverages only cross-attention between masked and visible +tokens, with no degradation in downstream performance. This design also enables +decoding only a small subset of mask tokens, boosting efficiency. Furthermore, +each decoder block can now leverage different encoder features, resulting in +improved representation learning. CrossMAE matches MAE in performance with 2.5 +to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet +classification and COCO instance segmentation under the same compute. Code and +models: https://crossmae.github.io",cs.CV,['cs.CV'] +Traceable Federated Continual Learning,Qiang Wang · Bingyan Liu · Yawen Li, ,https://arxiv.org/abs/2312.13500,,2312.13500.pdf,Federated Continual Novel Class Learning,"In a privacy-focused era, Federated Learning (FL) has emerged as a promising +machine learning technique. However, most existing FL studies assume that the +data distribution remains nearly fixed over time, while real-world scenarios +often involve dynamic and continual changes. To equip FL systems with continual +model evolution capabilities, we focus on an important problem called Federated +Continual Novel Class Learning (FedCN) in this work. The biggest challenge in +FedCN is to merge and align novel classes that are discovered and learned by +different clients without compromising privacy. To address this, we propose a +Global Alignment Learning (GAL) framework that can accurately estimate the +global novel class number and provide effective guidance for local training +from a global perspective, all while maintaining privacy protection. +Specifically, GAL first locates high-density regions in the representation +space through a bi-level clustering mechanism to estimate the novel class +number, with which the global prototypes corresponding to novel classes can be +constructed. Then, GAL uses a novel semantic weighted loss to capture all +possible correlations between these prototypes and the training data for +mitigating the impact of pseudo-label noise and data heterogeneity. Extensive +experiments on various datasets demonstrate GAL's superior performance over +state-of-the-art novel class discovery methods. In particular, GAL achieves +significant improvements in novel-class performance, increasing the accuracy by +5.1% to 10.6% in the case of one novel class learning stage and by 7.8% to +17.9% in the case of two novel class learning stages, without sacrificing +known-class performance. Moreover, GAL is shown to be effective in equipping a +variety of different mainstream FL algorithms with novel class discovery and +learning capability, highlighting its potential for many real-world +applications.",cs.CV,['cs.CV'] +PolarMatte: Fully Computational Ground-Truth-Quality Alpha Matte Extraction for Images and Video using Polarized Screen Matting,Kenji Enomoto · TJ Rhodes · Brian Price · Gavin Miller, ,https://arxiv.org/abs/2311.13535,,2311.13535.pdf,DiffusionMat: Alpha Matting as Sequential Refinement Learning,"In this paper, we introduce DiffusionMat, a novel image matting framework +that employs a diffusion model for the transition from coarse to refined alpha +mattes. Diverging from conventional methods that utilize trimaps merely as +loose guidance for alpha matte prediction, our approach treats image matting as +a sequential refinement learning process. This process begins with the addition +of noise to trimaps and iteratively denoises them using a pre-trained diffusion +model, which incrementally guides the prediction towards a clean alpha matte. +The key innovation of our framework is a correction module that adjusts the +output at each denoising step, ensuring that the final result is consistent +with the input image's structures. We also introduce the Alpha Reliability +Propagation, a novel technique designed to maximize the utility of available +guidance by selectively enhancing the trimap regions with confident alpha +information, thus simplifying the correction task. To train the correction +module, we devise specialized loss functions that target the accuracy of the +alpha matte's edges and the consistency of its opaque and transparent regions. +We evaluate our model across several image matting benchmarks, and the results +indicate that DiffusionMat consistently outperforms existing methods. Project +page at~\url{https://cnnlstm.github.io/DiffusionMat",cs.CV,['cs.CV'] +Relightable and Animatable Neural Avatar from Sparse-View Video,Zhen Xu · Sida Peng · Chen Geng · Linzhan Mou · Zihan Yan · Jiaming Sun · Hujun Bao · Xiaowei Zhou,https://zju3dv.github.io/relightable_avatar,https://arxiv.org/abs/2308.07903,,2308.07903.pdf,Relightable and Animatable Neural Avatar from Sparse-View Video,"This paper tackles the challenge of creating relightable and animatable +neural avatars from sparse-view (or even monocular) videos of dynamic humans +under unknown illumination. Compared to studio environments, this setting is +more practical and accessible but poses an extremely challenging ill-posed +problem. Previous neural human reconstruction methods are able to reconstruct +animatable avatars from sparse views using deformed Signed Distance Fields +(SDF) but cannot recover material parameters for relighting. While +differentiable inverse rendering-based methods have succeeded in material +recovery of static objects, it is not straightforward to extend them to dynamic +humans as it is computationally intensive to compute pixel-surface intersection +and light visibility on deformed SDFs for inverse rendering. To solve this +challenge, we propose a Hierarchical Distance Query (HDQ) algorithm to +approximate the world space distances under arbitrary human poses. +Specifically, we estimate coarse distances based on a parametric human model +and compute fine distances by exploiting the local deformation invariance of +SDF. Based on the HDQ algorithm, we leverage sphere tracing to efficiently +estimate the surface intersection and light visibility. This allows us to +develop the first system to recover animatable and relightable neural avatars +from sparse view (or monocular) inputs. Experiments demonstrate that our +approach is able to produce superior results compared to state-of-the-art +methods. Our code will be released for reproducibility.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']" +DeepCache: Accelerating Diffusion Models for Free,Xinyin Ma · Gongfan Fang · Xinchao Wang, ,https://arxiv.org/abs/2312.00858,,2312.00858.pdf,DeepCache: Accelerating Diffusion Models for Free,"Diffusion models have recently gained unprecedented attention in the field of +image synthesis due to their remarkable generative capabilities. +Notwithstanding their prowess, these models often incur substantial +computational costs, primarily attributed to the sequential denoising process +and cumbersome model size. Traditional methods for compressing diffusion models +typically involve extensive retraining, presenting cost and feasibility +challenges. In this paper, we introduce DeepCache, a novel training-free +paradigm that accelerates diffusion models from the perspective of model +architecture. DeepCache capitalizes on the inherent temporal redundancy +observed in the sequential denoising steps of diffusion models, which caches +and retrieves features across adjacent denoising stages, thereby curtailing +redundant computations. Utilizing the property of the U-Net, we reuse the +high-level features while updating the low-level features in a very cheap way. +This innovative strategy, in turn, enables a speedup factor of 2.3$\times$ for +Stable Diffusion v1.5 with only a 0.05 decline in CLIP Score, and 4.1$\times$ +for LDM-4-G with a slight decrease of 0.22 in FID on ImageNet. Our experiments +also demonstrate DeepCache's superiority over existing pruning and distillation +methods that necessitate retraining and its compatibility with current sampling +techniques. Furthermore, we find that under the same throughput, DeepCache +effectively achieves comparable or even marginally improved results with DDIM +or PLMS. The code is available at https://github.com/horseee/DeepCache",cs.CV,"['cs.CV', 'cs.AI']" +Unsupervised Occupancy Learning from Sparse Point Cloud,Amine Ouasfi · Adnane Boukhayma, ,https://arxiv.org/abs/2404.02759,,2404.02759.pdf,Unsupervised Occupancy Learning from Sparse Point Cloud,"Implicit Neural Representations have gained prominence as a powerful +framework for capturing complex data modalities, encompassing a wide range from +3D shapes to images and audio. Within the realm of 3D shape representation, +Neural Signed Distance Functions (SDF) have demonstrated remarkable potential +in faithfully encoding intricate shape geometry. However, learning SDFs from 3D +point clouds in the absence of ground truth supervision remains a very +challenging task. In this paper, we propose a method to infer occupancy fields +instead of SDFs as they are easier to learn from sparse inputs. We leverage a +margin-based uncertainty measure to differentially sample from the decision +boundary of the occupancy function and supervise the sampled boundary points +using the input point cloud. We further stabilize the optimization process at +the early stages of the training by biasing the occupancy function towards +minimal entropy fields while maximizing its entropy at the input point cloud. +Through extensive experiments and evaluations, we illustrate the efficacy of +our proposed method, highlighting its capacity to improve implicit shape +inference with respect to baselines and the state-of-the-art using synthetic +and real data.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']" +SGC-Occ: Semantic-Geometry Consistent 3D Occupancy Prediction for Autonomous Driving,Zhiwen Yang · Xiangteng He · Yuxin Peng, ,https://arxiv.org/abs/2403.08748,,2403.08748.pdf,Real-time 3D semantic occupancy prediction for autonomous vehicles using memory-efficient sparse convolution,"In autonomous vehicles, understanding the surrounding 3D environment of the +ego vehicle in real-time is essential. A compact way to represent scenes while +encoding geometric distances and semantic object information is via 3D semantic +occupancy maps. State of the art 3D mapping methods leverage transformers with +cross-attention mechanisms to elevate 2D vision-centric camera features into +the 3D domain. However, these methods encounter significant challenges in +real-time applications due to their high computational demands during +inference. This limitation is particularly problematic in autonomous vehicles, +where GPU resources must be shared with other tasks such as localization and +planning. In this paper, we introduce an approach that extracts features from +front-view 2D camera images and LiDAR scans, then employs a sparse convolution +network (Minkowski Engine), for 3D semantic occupancy prediction. Given that +outdoor scenes in autonomous driving scenarios are inherently sparse, the +utilization of sparse convolution is particularly apt. By jointly solving the +problems of 3D scene completion of sparse scenes and 3D semantic segmentation, +we provide a more efficient learning framework suitable for real-time +applications in autonomous vehicles. We also demonstrate competitive accuracy +on the nuScenes dataset.",cs.RO,"['cs.RO', 'cs.CV']" +Countering Personalized Text-to-Image Generation with Influence Watermarks,Hanwen Liu · Zhicheng Sun · Yadong Mu, ,https://arxiv.org/abs/2312.15905,,,Cross Initialization for Personalized Text-to-Image Generation,"Recently, there has been a surge in face personalization techniques, +benefiting from the advanced capabilities of pretrained text-to-image diffusion +models. Among these, a notable method is Textual Inversion, which generates +personalized images by inverting given images into textual embeddings. However, +methods based on Textual Inversion still struggle with balancing the trade-off +between reconstruction quality and editability. In this study, we examine this +issue through the lens of initialization. Upon closely examining traditional +initialization methods, we identified a significant disparity between the +initial and learned embeddings in terms of both scale and orientation. The +scale of the learned embedding can be up to 100 times greater than that of the +initial embedding. Such a significant change in the embedding could increase +the risk of overfitting, thereby compromising the editability. Driven by this +observation, we introduce a novel initialization method, termed Cross +Initialization, that significantly narrows the gap between the initial and +learned embeddings. This method not only improves both reconstruction and +editability but also reduces the optimization steps from 5000 to 320. +Furthermore, we apply a regularization term to keep the learned embedding close +to the initial embedding. We show that when combined with Cross Initialization, +this regularization term can effectively improve editability. We provide +comprehensive empirical evidence to demonstrate the superior performance of our +method compared to the baseline methods. Notably, in our experiments, Cross +Initialization is the only method that successfully edits an individual's +facial expression. Additionally, a fast version of our method allows for +capturing an input image in roughly 26 seconds, while surpassing the baseline +methods in terms of both reconstruction and editability. Code will be made +publicly available.",cs.CV,['cs.CV'] +GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields,Yunsong Wang · Hanlin Chen · Gim Hee Lee, ,https://arxiv.org/abs/2404.00931,,2404.00931.pdf,GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields,"Recent advancements in vision-language foundation models have significantly +enhanced open-vocabulary 3D scene understanding. However, the generalizability +of existing methods is constrained due to their framework designs and their +reliance on 3D data. We address this limitation by introducing Generalizable +Open-Vocabulary Neural Semantic Fields (GOV-NeSF), a novel approach offering a +generalizable implicit representation of 3D scenes with open-vocabulary +semantics. We aggregate the geometry-aware features using a cost volume, and +propose a Multi-view Joint Fusion module to aggregate multi-view features +through a cross-view attention mechanism, which effectively predicts +view-specific blending weights for both colors and open-vocabulary features. +Remarkably, our GOV-NeSF exhibits state-of-the-art performance in both 2D and +3D open-vocabulary semantic segmentation, eliminating the need for ground truth +semantic labels or depth priors, and effectively generalize across scenes and +datasets without fine-tuning.",cs.CV,['cs.CV'] +NeuRAD: Neural Rendering for Autonomous Driving,Adam Tonderski · Carl Lindström · Georg Hess · William Ljungbergh · Lennart Svensson · Christoffer Petersson,https://research.zenseact.com/publications/neurad/,https://arxiv.org/abs/2311.15260,,2311.15260.pdf,NeuRAD: Neural Rendering for Autonomous Driving,"Neural radiance fields (NeRFs) have gained popularity in the autonomous +driving (AD) community. Recent methods show NeRFs' potential for closed-loop +simulation, enabling testing of AD systems, and as an advanced training data +augmentation technique. However, existing methods often require long training +times, dense semantic supervision, or lack generalizability. This, in turn, +hinders the application of NeRFs for AD at scale. In this paper, we propose +NeuRAD, a robust novel view synthesis method tailored to dynamic AD data. Our +method features simple network design, extensive sensor modeling for both +camera and lidar -- including rolling shutter, beam divergence and ray dropping +-- and is applicable to multiple datasets out of the box. We verify its +performance on five popular AD datasets, achieving state-of-the-art performance +across the board. To encourage further development, we will openly release the +NeuRAD source code. See https://github.com/georghess/NeuRAD .",cs.CV,['cs.CV'] +Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning,Wei Zhang · Chaoqun Wan · Tongliang Liu · Xinmei Tian · Xu Shen · Jieping Ye, ,https://arxiv.org/abs/2404.00801,,2404.00801.pdf,$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding,"Video temporal grounding (VTG) is a fine-grained video understanding problem +that aims to ground relevant clips in untrimmed videos given natural language +queries. Most existing VTG models are built upon frame-wise final-layer CLIP +features, aided by additional temporal backbones (e.g., SlowFast) with +sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP +itself already shows great potential for fine-grained spatial-temporal +modeling, as each layer offers distinct yet useful information under different +granularity levels. Motivated by this, we propose Reversed Recurrent Tuning +($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework +for video temporal grounding. Our method learns a lightweight $R^2$ Block +containing only 1.5% of the total parameters to perform progressive +spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block +recurrently aggregates spatial features from earlier layers, then refines +temporal correlation conditioning on the given query, resulting in a +coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance +across three VTG tasks (i.e., moment retrieval, highlight detection, and video +summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, +Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional +backbone, demonstrating the significance and effectiveness of the proposed +scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.",cs.CV,['cs.CV'] +Enhancing Vision-Language Pretraining with Rich Supervisions,Yuan Gao · Kunyu Shi · Pengkai Zhu · Edouard Belval · Oren Nuriel · Srikar Appalaraju · Shabnam Ghadar · Zhuowen Tu · Vijay Mahadevan · Stefano Soatto, ,https://arxiv.org/abs/2403.03346,,,Enhancing Vision-Language Pre-training with Rich Supervisions,"We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel +pre-training paradigm for Vision-Language Models using data from large-scale +web screenshot rendering. Using web screenshots unlocks a treasure trove of +visual and textual cues that are not present in using image-text pairs. In S4, +we leverage the inherent tree-structured hierarchy of HTML elements and the +spatial localization to carefully design 10 pre-training tasks with large scale +annotated data. These tasks resemble downstream tasks across different domains +and the annotations are cheap to obtain. We demonstrate that, compared to +current screenshot pre-training objectives, our innovative pre-training method +significantly enhances performance of image-to-text model in nine varied and +popular downstream tasks - up to 76.1% improvements on Table Detection, and at +least 1% on Widget Captioning.",cs.CV,['cs.CV'] +A Category Agnostic Model for Visual Rearrangement,Yuyi Liu · Xinhang Song · Weijie Li · Xiaohan Wang · Shuqiang Jiang, ,,http://vipl.ict.ac.cn/en/news/researchevents/202403/t20240315_207762.html,,,,,nan +Polos: Multimodal Metric Learning from Human Feedback for Image Captioning,Yuiga Wada · Kanta Kaneda · Daichi Saito · Komei Sugiura,https://yuiga.dev/polos,https://arxiv.org/abs/2402.18091,,2402.18091.pdf,Polos: Multimodal Metric Learning from Human Feedback for Image Captioning,"Establishing an automatic evaluation metric that closely aligns with human +judgments is essential for effectively developing image captioning models. +Recent data-driven metrics have demonstrated a stronger correlation with human +judgments than classic metrics such as CIDEr; however they lack sufficient +capabilities to handle hallucinations and generalize across diverse images and +texts partially because they compute scalar similarities merely using +embeddings learned from tasks unrelated to image captioning evaluation. In this +study, we propose Polos, a supervised automatic evaluation metric for image +captioning models. Polos computes scores from multimodal inputs, using a +parallel feature extraction mechanism that leverages embeddings trained through +large-scale contrastive learning. To train Polos, we introduce Multimodal +Metric Learning from Human Feedback (M$^2$LHF), a framework for developing +metrics based on human feedback. We constructed the Polaris dataset, which +comprises 131K human judgments from 550 evaluators, which is approximately ten +times larger than standard datasets. Our approach achieved state-of-the-art +performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and +the Polaris dataset, thereby demonstrating its effectiveness and robustness.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +CLIB-FIQA: Face Image Quality Assessment with Confidence Calibration,Fu-Zhao Ou · Fu-Zhao Ou · Chongyi Li · Shiqi Wang · Sam Kwong, ,https://arxiv.org/abs/2404.12203,,2404.12203.pdf,GraFIQs: Face Image Quality Assessment Using Gradient Magnitudes,"Face Image Quality Assessment (FIQA) estimates the utility of face images for +automated face recognition (FR) systems. We propose in this work a novel +approach to assess the quality of face images based on inspecting the required +changes in the pre-trained FR model weights to minimize differences between +testing samples and the distribution of the FR training dataset. To achieve +that, we propose quantifying the discrepancy in Batch Normalization statistics +(BNS), including mean and variance, between those recorded during FR training +and those obtained by processing testing samples through the pretrained FR +model. We then generate gradient magnitudes of pretrained FR weights by +backpropagating the BNS through the pretrained model. The cumulative absolute +sum of these gradient magnitudes serves as the FIQ for our approach. Through +comprehensive experimentation, we demonstrate the effectiveness of our +training-free and quality labeling-free approach, achieving competitive +performance to recent state-of-theart FIQA approaches without relying on +quality labeling, the need to train regression networks, specialized +architectures, or designing and optimizing specific loss functions.",cs.CV,['cs.CV'] +EVCap: Retrieval-Augmented Image Captioning with External Visual--Name Memory for Open-World Comprehension,Jiaxuan Li · Duc Minh Vo · Akihiro Sugimoto · Hideki Nakayama, ,https://arxiv.org/abs/2311.15879v2,,2311.15879v2.pdf,EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension,"Large language models (LLMs)-based image captioning has the capability of +describing objects not explicitly observed in training data; yet novel objects +occur frequently, necessitating the requirement of sustaining up-to-date object +knowledge for open-world comprehension. Instead of relying on large amounts of +data and/or scaling up network parameters, we introduce a highly effective +retrieval-augmented image captioning method that prompts LLMs with object names +retrieved from External Visual--name memory (EVCap). We build ever-changing +object knowledge memory using objects' visuals and names, enabling us to (i) +update the memory at a minimal cost and (ii) effortlessly augment LLMs with +retrieved object names by utilizing a lightweight and fast-to-train model. Our +model, which was trained only on the COCO dataset, can adapt to out-of-domain +without requiring additional fine-tuning or re-training. Our experiments +conducted on benchmarks and synthetic commonsense-violating data show that +EVCap, with only 3.97M trainable parameters, exhibits superior performance +compared to other methods based on frozen pre-trained LLMs. Its performance is +also competitive to specialist SOTAs that require extensive training.",cs.CV,['cs.CV'] +On Exact Inversion of DPM-Solvers,Seongmin Hong · Kyeonghyun Lee · Suh Yoon Jeon · Hyewon Bae · Se Young Chun,https://smhongok.github.io/inv-dpm.html,https://arxiv.org/abs/2311.18387v1,,2311.18387v1.pdf,On Exact Inversion of DPM-Solvers,"Diffusion probabilistic models (DPMs) are a key component in modern +generative models. DPM-solvers have achieved reduced latency and enhanced +quality significantly, but have posed challenges to find the exact inverse +(i.e., finding the initial noise from the given image). Here we investigate the +exact inversions for DPM-solvers and propose algorithms to perform them when +samples are generated by the first-order as well as higher-order DPM-solvers. +For each explicit denoising step in DPM-solvers, we formulated the inversions +using implicit methods such as gradient descent or forward step method to +ensure the robustness to large classifier-free guidance unlike the prior +approach using fixed-point iteration. Experimental results demonstrated that +our proposed exact inversion methods significantly reduced the error of both +image and noise reconstructions, greatly enhanced the ability to distinguish +invisible watermarks and well prevented unintended background changes +consistently during image editing. Project page: +\url{https://smhongok.github.io/inv-dpm.html}.",cs.CV,"['cs.CV', 'cs.LG']" +Learning Structure-from-Motion with Graph Attention Networks,Lucas Brynte · José Pedro Iglesias · Carl Olsson · Fredrik Kahl,https://github.com/lucasbrynte/gasfm/,https://arxiv.org/abs/2308.15984,,2308.15984.pdf,Learning Structure-from-Motion with Graph Attention Networks,"In this paper we tackle the problem of learning Structure-from-Motion (SfM) +through the use of graph attention networks. SfM is a classic computer vision +problem that is solved though iterative minimization of reprojection errors, +referred to as Bundle Adjustment (BA), starting from a good initialization. In +order to obtain a good enough initialization to BA, conventional methods rely +on a sequence of sub-problems (such as pairwise pose estimation, pose averaging +or triangulation) which provide an initial solution that can then be refined +using BA. In this work we replace these sub-problems by learning a model that +takes as input the 2D keypoints detected across multiple views, and outputs the +corresponding camera poses and 3D keypoint coordinates. Our model takes +advantage of graph neural networks to learn SfM-specific primitives, and we +show that it can be used for fast inference of the reconstruction for new and +unseen sequences. The experimental results show that the proposed model +outperforms competing learning-based methods, and challenges COLMAP while +having lower runtime. Our code is available at +https://github.com/lucasbrynte/gasfm/.",cs.CV,"['cs.CV', 'cs.LG']" +Plug and Play Active Learning for Object Detection,Chenhongyi Yang · Lichao Huang · Elliot Crowley, ,,https://allainews.com/item/plug-and-play-active-learning-for-object-detection-2024-03-15/,,,,,nan +MACE: Mass Concept Erasure in Diffusion Models,Shilin Lu · Zilan Wang · Leyang Li · Yanzhu Liu · Adams Wai-Kin Kong,https://github.com/Shilin-LU/MACE,https://arxiv.org/abs/2403.06135,,2403.06135.pdf,MACE: Mass Concept Erasure in Diffusion Models,"The rapid expansion of large-scale text-to-image diffusion models has raised +growing concerns regarding their potential misuse in creating harmful or +misleading content. In this paper, we introduce MACE, a finetuning framework +for the task of mass concept erasure. This task aims to prevent models from +generating images that embody unwanted concepts when prompted. Existing concept +erasure methods are typically restricted to handling fewer than five concepts +simultaneously and struggle to find a balance between erasing concept synonyms +(generality) and maintaining unrelated concepts (specificity). In contrast, +MACE differs by successfully scaling the erasure scope up to 100 concepts and +by achieving an effective balance between generality and specificity. This is +achieved by leveraging closed-form cross-attention refinement along with LoRA +finetuning, collectively eliminating the information of undesirable concepts. +Furthermore, MACE integrates multiple LoRAs without mutual interference. We +conduct extensive evaluations of MACE against prior methods across four +different tasks: object erasure, celebrity erasure, explicit content erasure, +and artistic style erasure. Our results reveal that MACE surpasses prior +methods in all evaluated tasks. Code is available at +https://github.com/Shilin-LU/MACE.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Contextual Augmented Global Contrast for Multimodal Intent Recognition,Kaili Sun · Zhiwen Xie · Mang Ye · Huyin Zhang, ,https://arxiv.org/html/2312.14667v1,,2312.14667v1.pdf,Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition,"Multimodal intent recognition aims to leverage diverse modalities such as +expressions, body movements and tone of speech to comprehend user's intent, +constituting a critical task for understanding human language and behavior in +real-world multimodal scenarios. Nevertheless, the majority of existing methods +ignore potential correlations among different modalities and own limitations in +effectively learning semantic features from nonverbal modalities. In this +paper, we introduce a token-level contrastive learning method with +modality-aware prompting (TCL-MAP) to address the above challenges. To +establish an optimal multimodal semantic environment for text modality, we +develop a modality-aware prompting module (MAP), which effectively aligns and +fuses features from text, video and audio modalities with similarity-based +modality alignment and cross-modality attention mechanism. Based on the +modality-aware prompt and ground truth labels, the proposed token-level +contrastive learning framework (TCL) constructs augmented samples and employs +NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal +textual semantic insights derived from intent labels to guide the learning +processes of other modalities in return. Extensive experiments show that our +method achieves remarkable improvements compared to state-of-the-art methods. +Additionally, ablation analyses demonstrate the superiority of the +modality-aware prompt over the handcrafted prompt, which holds substantial +significance for multimodal prompt learning. The codes are released at +https://github.com/thuiar/TCL-MAP.",cs.MM,"['cs.MM', 'cs.LG']" +Fixed Point Diffusion Models,Luke Melas-Kyriazi · Xingjian Bai, ,https://arxiv.org/abs/2401.08741,,2401.08741.pdf,Fixed Point Diffusion Models,"We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to +image generation that integrates the concept of fixed point solving into the +framework of diffusion-based generative modeling. Our approach embeds an +implicit fixed point solving layer into the denoising network of a diffusion +model, transforming the diffusion process into a sequence of closely-related +fixed point problems. Combined with a new stochastic training method, this +approach significantly reduces model size, reduces memory usage, and +accelerates training. Moreover, it enables the development of two new +techniques to improve sampling efficiency: reallocating computation across +timesteps and reusing fixed point solutions between timesteps. We conduct +extensive experiments with state-of-the-art models on ImageNet, FFHQ, +CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in +performance and efficiency. Compared to the state-of-the-art DiT model, FPDM +contains 87% fewer parameters, consumes 60% less memory during training, and +improves image generation quality in situations where sampling computation or +time is limited. Our code and pretrained models are available at +https://lukemelas.github.io/fixed-point-diffusion-models.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +High Fidelity Person-centric Subject-to-Image Synthesis,Yibin Wang · Weizhong Zhang · Jianwei Zheng · Cheng Jin, ,https://arxiv.org/abs/2311.10329,,2311.10329.pdf,High-fidelity Person-centric Subject-to-Image Synthesis,"Current subject-driven image generation methods encounter significant +challenges in person-centric image generation. The reason is that they learn +the semantic scene and person generation by fine-tuning a common pre-trained +diffusion, which involves an irreconcilable training imbalance. Precisely, to +generate realistic persons, they need to sufficiently tune the pre-trained +model, which inevitably causes the model to forget the rich semantic scene +prior and makes scene generation over-fit to the training data. Moreover, even +with sufficient fine-tuning, these methods can still not generate high-fidelity +persons since joint learning of the scene and person generation also lead to +quality compromise. In this paper, we propose Face-diffuser, an effective +collaborative generation pipeline to eliminate the above training imbalance and +quality compromise. Specifically, we first develop two specialized pre-trained +diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented +Diffusion Model (SDM), for scene and person generation, respectively. The +sampling process is divided into three sequential stages, i.e., semantic scene +construction, subject-scene fusion, and subject enhancement. The first and last +stages are performed by TDM and SDM respectively. The subject-scene fusion +stage, that is the collaboration achieved through a novel and highly effective +mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on +our key observation that there exists a robust link between classifier-free +guidance responses and the saliency of generated images. In each time step, SNF +leverages the unique strengths of each model and allows for the spatial +blending of predicted noises from both models automatically in a saliency-aware +manner. Extensive experiments confirm the impressive effectiveness and +robustness of the Face-diffuser.",cs.CV,"['cs.CV', 'cs.AI']" +On the Content Bias in Fréchet Video Distance,Songwei Ge · Aniruddha Mahapatra · Gaurav Parmar · Jun-Yan Zhu · Jia-Bin Huang, ,https://arxiv.org/abs/2404.12391,,2404.12391.pdf,On the Content Bias in Fréchet Video Distance,"Fr\'echet Video Distance (FVD), a prominent metric for evaluating video +generation models, is known to conflict with human perception occasionally. In +this paper, we aim to explore the extent of FVD's bias toward per-frame quality +over temporal realism and identify its sources. We first quantify the FVD's +sensitivity to the temporal axis by decoupling the frame and motion quality and +find that the FVD increases only slightly with large temporal corruption. We +then analyze the generated videos and show that via careful sampling from a +large set of generated videos that do not contain motions, one can drastically +decrease FVD without improving the temporal quality. Both studies suggest FVD's +bias towards the quality of individual frames. We further observe that the bias +can be attributed to the features extracted from a supervised video classifier +trained on the content-biased dataset. We show that FVD with features extracted +from the recent large-scale self-supervised video models is less biased toward +image quality. Finally, we revisit a few real-world examples to validate our +hypothesis.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification,kaijie ren · Lei Zhang, ,https://arxiv.org/abs/2403.11708v2,,2403.11708v2.pdf,Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification,"Visible-Infrared Person Re-identification (VI-ReID) is a challenging +cross-modal pedestrian retrieval task, due to significant intra-class +variations and cross-modal discrepancies among different cameras. Existing +works mainly focus on embedding images of different modalities into a unified +space to mine modality-shared features. They only seek distinctive information +within these shared features, while ignoring the identity-aware useful +information that is implicit in the modality-specific features. To address this +issue, we propose a novel Implicit Discriminative Knowledge Learning (IDKL) +network to uncover and leverage the implicit discriminative information +contained within the modality-specific. First, we extract modality-specific and +modality-shared features using a novel dual-stream network. Then, the +modality-specific features undergo purification to reduce their modality style +discrepancies while preserving identity-aware discriminative knowledge. +Subsequently, this kind of implicit knowledge is distilled into the +modality-shared feature to enhance its distinctiveness. Finally, an alignment +loss is proposed to minimize modality discrepancy on enhanced modality-shared +features. Extensive experiments on multiple public datasets demonstrate the +superiority of IDKL network over the state-of-the-art methods. Code is +available at https://github.com/1KK077/IDKL.",cs.CV,['cs.CV'] +PointBeV: A Sparse Approach for BeV Predictions,Loick Chambon · Éloi Zablocki · Mickaël Chen · Florent Bartoccioni · Patrick Pérez · Matthieu Cord, ,https://arxiv.org/abs/2312.00703,,2312.00703.pdf,PointBeV: A Sparse Approach to BeV Predictions,"Bird's-eye View (BeV) representations have emerged as the de-facto shared +space in driving applications, offering a unified space for sensor data fusion +and supporting various downstream tasks. However, conventional models use grids +with fixed resolution and range and face computational inefficiencies due to +the uniform allocation of resources across all cells. To address this, we +propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV +cells instead of dense grids. This approach offers precise control over memory +usage, enabling the use of long temporal contexts and accommodating +memory-constrained platforms. PointBeV employs an efficient two-pass strategy +for training, enabling focused computation on regions of interest. At inference +time, it can be used with various memory/performance trade-offs and flexibly +adjusts to new specific use cases. PointBeV achieves state-of-the-art results +on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, +showcasing superior performance in static and temporal settings despite being +trained solely with sparse signals. We will release our code along with two new +efficient modules used in the architecture: Sparse Feature Pulling, designed +for the effective extraction of features from images to BeV, and Submanifold +Attention, which enables efficient temporal modeling. Our code is available at +https://github.com/valeoai/PointBeV.",cs.CV,['cs.CV'] +Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion,Su Sun · Cheng Zhao · Yuliang Guo · Ruoyu Wang · Xinyu Huang · Yingjie Victor Chen · Liu Ren, ,https://arxiv.org/abs/2404.03070,,2404.03070.pdf,Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion,"In this paper, we present a novel indoor 3D reconstruction method with +occluded surface completion, given a sequence of depth readings. Prior +state-of-the-art (SOTA) methods only focus on the reconstruction of the visible +areas in a scene, neglecting the invisible areas due to the occlusions, e.g., +the contact surface between furniture, occluded wall and floor. Our method +tackles the task of completing the occluded scene surfaces, resulting in a +complete 3D scene mesh. The core idea of our method is learning 3D geometry +prior from various complete scenes to infer the occluded geometry of an unseen +scene from solely depth measurements. We design a coarse-fine hierarchical +octree representation coupled with a dual-decoder architecture, i.e., +Geo-decoder and 3D Inpainter, which jointly reconstructs the complete 3D scene +geometry. The Geo-decoder with detailed representation at fine levels is +optimized online for each scene to reconstruct visible surfaces. The 3D +Inpainter with abstract representation at coarse levels is trained offline +using various scenes to complete occluded surfaces. As a result, while the +Geo-decoder is specialized for an individual scene, the 3D Inpainter can be +generally applied across different scenes. We evaluate the proposed method on +the 3D Completed Room Scene (3D-CRS) and iTHOR datasets, significantly +outperforming the SOTA methods by a gain of 16.8% and 24.2% in terms of the +completeness of 3D reconstruction. 3D-CRS dataset including a complete 3D mesh +of each scene is provided at project webpage.",cs.CV,['cs.CV'] +VidLA: Video-Language Alignment at Scale,Mamshad Nayeem Rizve · Fan Fei · Jayakrishnan Unnikrishnan · Son Dinh Tran · Benjamin Yao · Belinda Zeng · Mubarak Shah · Trishul Chilimbi, ,https://arxiv.org/abs/2403.14870,,2403.14870.pdf,VidLA: Video-Language Alignment at Scale,"In this paper, we propose VidLA, an approach for video-language alignment at +scale. There are two major limitations of previous video-language alignment +approaches. First, they do not capture both short-range and long-range temporal +dependencies and typically employ complex hierarchical deep network +architectures that are hard to integrate with existing pretrained image-text +foundation models. To effectively address this limitation, we instead keep the +network architecture simple and use a set of data tokens that operate at +different temporal resolutions in a hierarchical manner, accounting for the +temporally hierarchical nature of videos. By employing a simple two-tower +architecture, we are able to initialize our video-language model with +pretrained image-text foundation models, thereby boosting the final +performance. Second, existing video-language alignment works struggle due to +the lack of semantically aligned large-scale training data. To overcome it, we +leverage recent LLMs to curate the largest video-language dataset to date with +better visual grounding. Furthermore, unlike existing video-text datasets which +only contain short clips, our dataset is enriched with video clips of varying +durations to aid our temporally hierarchical data tokens in extracting better +representations at varying temporal scales. Overall, empirical results show +that our proposed approach surpasses state-of-the-art methods on multiple +retrieval benchmarks, especially on longer videos, and performs competitively +on classification benchmarks.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +ODIN: A Single Model for 2D and 3D Segmentation,Ayush Jain · Pushkal Katara · Nikolaos Gkanatsios · Adam Harley · Gabriel Sarch · Kriti Aggarwal · Vishrav Chaudhary · Katerina Fragkiadaki, ,https://arxiv.org/abs/2401.02416,,2401.02416.pdf,ODIN: A Single Model for 2D and 3D Segmentation,"State-of-the-art models on contemporary 3D segmentation benchmarks like +ScanNet consume and label dataset-provided 3D point clouds, obtained through +post processing of sensed multiview RGB-D images. They are typically trained +in-domain, forego large-scale 2D pre-training and outperform alternatives that +featurize the posed RGB-D multiview images instead. The gap in performance +between methods that consume posed images versus post-processed 3D point clouds +has fueled the belief that 2D and 3D perception require distinct model +architectures. In this paper, we challenge this view and propose ODIN +(Omni-Dimensional INstance segmentation), a model that can segment and label +both 2D RGB images and 3D point clouds, using a transformer architecture that +alternates between 2D within-view and 3D cross-view information fusion. Our +model differentiates 2D and 3D feature operations through the positional +encodings of the tokens involved, which capture pixel coordinates for 2D patch +tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art +performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation +benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It +outperforms all previous works by a wide margin when the sensed 3D point cloud +is used in place of the point cloud sampled from 3D mesh. When used as the 3D +perception engine in an instructable embodied agent architecture, it sets a new +state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and +checkpoints can be found at the project website (https://odin-seg.github.io).",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']" +VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding,Syed Talal Wasim · Muzammal Naseer · Salman Khan · Ming-Hsuan Yang · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2401.00901,,2401.00901.pdf,Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding,"Video grounding aims to localize a spatio-temporal section in a video +corresponding to an input text query. This paper addresses a critical +limitation in current video grounding methodologies by introducing an +Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent +closed-set approaches that struggle with open-vocabulary scenarios due to +limited training data and predefined vocabularies, our model leverages +pre-trained representations from foundational spatial grounding models. This +empowers it to effectively bridge the semantic gap between natural language and +diverse visual content, achieving strong performance in closed-set and +open-vocabulary settings. Our contributions include a novel spatio-temporal +video grounding model, surpassing state-of-the-art results in closed-set +evaluations on multiple datasets and demonstrating superior performance in +open-vocabulary scenarios. Notably, the proposed model outperforms +state-of-the-art methods in closed-set settings on VidSTG (Declarative and +Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in +open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model +surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$ +accuracy, demonstrating its efficacy in handling diverse linguistic and visual +concepts for improved video understanding. Our codes will be publicly released.",cs.CV,['cs.CV'] +Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts,Jiawen Zhu · Guansong Pang, ,https://arxiv.org/abs/2403.06495,,2403.06495.pdf,Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts,"This paper explores the problem of Generalist Anomaly Detection (GAD), aiming +to train one single detection model that can generalize to detect anomalies in +diverse datasets from different application domains without any further +training on the target data. Some recent studies have shown that large +pre-trained Visual-Language Models (VLMs) like CLIP have strong generalization +capabilities on detecting industrial defects from various datasets, but their +methods rely heavily on handcrafted text prompts about defects, making them +difficult to generalize to anomalies in other applications, e.g., medical image +anomalies or semantic anomalies in natural images. In this work, we propose to +train a GAD model with few-shot normal images as sample prompts for AD on +diverse datasets on the fly. To this end, we introduce a novel approach that +learns an in-context residual learning model for GAD, termed InCTRL. It is +trained on an auxiliary dataset to discriminate anomalies from normal samples +based on a holistic evaluation of the residuals between query images and +few-shot normal sample prompts. Regardless of the datasets, per definition of +anomaly, larger residuals are expected for anomalies than normal samples, +thereby enabling InCTRL to generalize across different domains without further +training. Comprehensive experiments on nine AD datasets are performed to +establish a GAD benchmark that encapsulate the detection of industrial defect +anomalies, medical anomalies, and semantic anomalies in both one-vs-all and +multi-class setting, on which InCTRL is the best performer and significantly +outperforms state-of-the-art competing methods. Code is available at +https://github.com/mala-lab/InCTRL.",cs.CV,['cs.CV'] +ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting,Chen Duan · Pei Fu · Shan Guo · Qianyi Jiang · Xiaoming Wei, ,https://arxiv.org/abs/2403.00303,,2403.00303.pdf,ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting,"In recent years, text-image joint pre-training techniques have shown +promising results in various tasks. However, in Optical Character Recognition +(OCR) tasks, aligning text instances with their corresponding text regions in +images poses a challenge, as it requires effective alignment between text and +OCR-Text (referring to the text in images as OCR-Text to distinguish from the +text in natural language) rather than a holistic understanding of the overall +image content. In this paper, we propose a new pre-training method called +OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text +found in images to a uniform style based on the text prompt. With ODM, we +achieve better alignment between text and OCR-Text and enable pre-trained +models to adapt to the complex and diverse styles of scene text detection and +spotting tasks. Additionally, we have designed a new labeling generation method +specifically for ODM and combined it with our proposed Text-Controller module +to address the challenge of annotation costs in OCR tasks, allowing a larger +amount of unlabeled data to participate in pre-training. Extensive experiments +on multiple public datasets demonstrate that our method significantly improves +performance and outperforms current pre-training methods in scene text +detection and spotting tasks. Code is available at +https://github.com/PriNing/ODM.",cs.CV,['cs.CV'] +"LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning",Sijin Chen · Xin Chen · Chi Zhang · Mingsheng Li · Gang Yu · Hao Fei · Hongyuan Zhu · Jiayuan Fan · Tao Chen, ,https://arxiv.org/abs/2311.18651,,2311.18651.pdf,"LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning","Recent advances in Large Multimodal Models (LMM) have made it possible for +various applications in human-machine interactions. However, developing LMMs +that can comprehend, reason, and plan in complex and diverse 3D environments +remains a challenging topic, especially considering the demand for +understanding permutation-invariant point cloud 3D representations of the 3D +scene. Existing works seek help from multi-view images, and project 2D features +to 3D space as 3D scene representations. This, however, leads to huge +computational overhead and performance degradation. In this paper, we present +LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and +respond to both textual-instructions and visual-prompts. This help LMMs better +comprehend human interactions and further help to remove the ambiguities in +cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, +and surpasses various 3D vision-language models on both 3D Dense Captioning and +3D Question Answering.",cs.CV,['cs.CV'] +UV-IDM: Identity-Conditioned Latent Diffusion Model for Face UV-Texture Generation,Hong Li · Yutang Feng · Song Xue · Xuhui Liu · Boyu Liu · Bohan Zeng · Shanglin Li · Jianzhuang Liu · Shumin Han · Baochang Zhang, ,https://arxiv.org/abs/2403.19235,,,DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation,"While large-scale pre-trained text-to-image models can synthesize diverse and +high-quality human-centered images, novel challenges arise with a nuanced task +of ""identity fine editing"": precisely modifying specific features of a subject +while maintaining its inherent identity and context. Existing personalization +methods either require time-consuming optimization or learning additional +encoders, adept in ""identity re-contextualization"". However, they often +struggle with detailed and sensitive tasks like human face editing. To address +these challenges, we introduce DreamSalon, a noise-guided, staged-editing +framework, uniquely focusing on detailed image manipulations and +identity-context preservation. By discerning editing and boosting stages via +the frequency and gradient of predicted noises, DreamSalon first performs +detailed manipulations on specific features in the editing stage, guided by +high-frequency information, and then employs stochastic denoising in the +boosting stage to improve image quality. For more precise editing, DreamSalon +semantically mixes source and target textual prompts, guided by differences in +their embedding covariances, to direct the model's focus on specific +manipulation areas. Our experiments demonstrate DreamSalon's ability to +efficiently and faithfully edit fine details on human faces, outperforming +existing methods both qualitatively and quantitatively.",cs.CV,['cs.CV'] +Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation,Guangyang Wu · Xiaohong Liu · Jun Jia · Xuehao Cui · Guangtao Zhai, ,https://arxiv.org/abs/2403.06452,,2403.06452.pdf,Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation,"In the digital era, QR codes serve as a linchpin connecting virtual and +physical realms. Their pervasive integration across various applications +highlights the demand for aesthetically pleasing codes without compromised +scannability. However, prevailing methods grapple with the intrinsic challenge +of balancing customization and scannability. Notably, stable-diffusion models +have ushered in an epoch of high-quality, customizable content generation. This +paper introduces Text2QR, a pioneering approach leveraging these advancements +to address a fundamental challenge: concurrently achieving user-defined +aesthetics and scanning robustness. To ensure stable generation of aesthetic QR +codes, we introduce the QR Aesthetic Blueprint (QAB) module, generating a +blueprint image exerting control over the entire generation process. +Subsequently, the Scannability Enhancing Latent Refinement (SELR) process +refines the output iteratively in the latent space, enhancing scanning +robustness. This approach harnesses the potent generation capabilities of +stable-diffusion models, navigating the trade-off between image aesthetics and +QR code scannability. Our experiments demonstrate the seamless fusion of visual +appeal with the practical utility of aesthetic QR codes, markedly outperforming +prior methods. Codes are available at \url{https://github.com/mulns/Text2QR}",cs.CV,['cs.CV'] +Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers,Zi-Xin Zou · Zhipeng Yu · Yuan-Chen Guo · Yangguang Li · Yan-Pei Cao · Ding Liang · Song-Hai Zhang,https://zouzx.github.io/TriplaneGaussian/,https://arxiv.org/abs/2312.09147,,2312.09147.pdf,Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers,"Recent advancements in 3D reconstruction from single images have been driven +by the evolution of generative models. Prominent among these are methods based +on Score Distillation Sampling (SDS) and the adaptation of diffusion models in +the 3D domain. Despite their progress, these techniques often face limitations +due to slow optimization or rendering processes, leading to extensive training +and optimization times. In this paper, we introduce a novel approach for +single-view reconstruction that efficiently generates a 3D model from a single +image via feed-forward inference. Our method utilizes two transformer-based +networks, namely a point decoder and a triplane decoder, to reconstruct 3D +objects using a hybrid Triplane-Gaussian intermediate representation. This +hybrid representation strikes a balance, achieving a faster rendering speed +compared to implicit representations while simultaneously delivering superior +rendering quality than explicit representations. The point decoder is designed +for generating point clouds from single images, offering an explicit +representation which is then utilized by the triplane decoder to query Gaussian +features for each point. This design choice addresses the challenges associated +with directly regressing explicit 3D Gaussian attributes characterized by their +non-structural nature. Subsequently, the 3D Gaussians are decoded by an MLP to +enable rapid rendering through splatting. Both decoders are built upon a +scalable, transformer-based architecture and have been efficiently trained on +large-scale 3D datasets. The evaluations conducted on both synthetic datasets +and real-world images demonstrate that our method not only achieves higher +quality but also ensures a faster runtime in comparison to previous +state-of-the-art techniques. Please see our project page at +https://zouzx.github.io/TriplaneGaussian/.",cs.CV,['cs.CV'] +Active Object Detection with Knowledge Aggregation and Distillation from Large Models,Dejie Yang · Yang Liu, ,https://arxiv.org/abs/2405.12509,,2405.12509.pdf,Active Object Detection with Knowledge Aggregation and Distillation from Large Models,"Accurately detecting active objects undergoing state changes is essential for +comprehending human interactions and facilitating decision-making. The existing +methods for active object detection (AOD) primarily rely on visual appearance +of the objects within input, such as changes in size, shape and relationship +with hands. However, these visual changes can be subtle, posing challenges, +particularly in scenarios with multiple distracting no-change instances of the +same category. We observe that the state changes are often the result of an +interaction being performed upon the object, thus propose to use informed +priors about object related plausible interactions (including semantics and +visual appearance) to provide more reliable cues for AOD. Specifically, we +propose a knowledge aggregation procedure to integrate the aforementioned +informed priors into oracle queries within the teacher decoder, offering more +object affordance commonsense to locate the active object. To streamline the +inference process and reduce extra knowledge inputs, we propose a knowledge +distillation approach that encourages the student decoder to mimic the +detection capabilities of the teacher decoder using the oracle query by +replicating its predictions and attention. Our proposed framework achieves +state-of-the-art performance on four datasets, namely Ego4D, Epic-Kitchens, +MECCANO, and 100DOH, which demonstrates the effectiveness of our approach in +improving AOD.",cs.CV,['cs.CV'] +Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs,shiyu xuan · Qingpei Guo · Ming Yang · Shiliang Zhang, ,https://arxiv.org/abs/2310.00582,,,Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs,"Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities +in various multi-modal tasks. Nevertheless, their performance in fine-grained +image understanding tasks is still limited. To address this issue, this paper +proposes a new framework to enhance the fine-grained image understanding +abilities of MLLMs. Specifically, we present a new method for constructing the +instruction tuning dataset at a low cost by leveraging annotations in existing +datasets. A self-consistent bootstrapping method is also introduced to extend +existing dense object annotations into high-quality +referring-expression-bounding-box pairs. These methods enable the generation of +high-quality instruction data which includes a wide range of fundamental +abilities essential for fine-grained image perception. Moreover, we argue that +the visual encoder should be tuned during instruction tuning to mitigate the +gap between full image perception and fine-grained image perception. +Experimental results demonstrate the superior performance of our method. For +instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA +and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also +attained the top rank on the leaderboard of MMBench. This promising performance +is achieved by training on only publicly available data, making it easily +reproducible. The models, datasets, and codes are publicly available at +https://github.com/SY-Xuan/Pink.",cs.CV,"['cs.CV', 'cs.AI']" +PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization,Yanlu Cai · Weizhong Zhang · Yuan Wu · Cheng Jin, ,https://arxiv.org/abs/2405.05216,,,FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,"The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to +predict human joint coordinates in 3D space. Despite recent advancements in +deep learning-based methods, they mostly ignore the capability of coupling +accessible texts and naturally feasible knowledge of humans, missing out on +valuable implicit supervision to guide the 3D HPE task. Moreover, previous +efforts often study this task from the perspective of the whole human body, +neglecting fine-grained guidance hidden in different body parts. To this end, +we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model +for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing +the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt +learning (FPP) block constructs fine-grained part-aware prompts via coupling +accessible texts and naturally feasible knowledge of body parts with learnable +prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication +(FPC) block establishes fine-grained communications between learned part-aware +prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp +Stylization (PTS) block integrates learned prompt embedding and temporal +information related to the noise level to enable adaptive adjustment at each +denoising step. Extensive experiments on public single-human pose estimation +datasets show that FinePOSE outperforms state-of-the-art methods. We further +extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE +on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with +complex multi-human scenarios. Code is available at +https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.",cs.CV,['cs.CV'] +Transcriptomics-guided Slide Representation Learning in Computational Pathology,Guillaume Jaume · Lukas Oldenburg · Anurag Vaidya · Richard J. Chen · Drew F. K. Williamson · Thomas Peeters · Andrew Song · Faisal Mahmood,https://github.com/mahmoodlab/TANGLE,https://arxiv.org/abs/2405.11618,,2405.11618.pdf,Transcriptomics-guided Slide Representation Learning in Computational Pathology,"Self-supervised learning (SSL) has been successful in building patch +embeddings of small histology images (e.g., 224x224 pixels), but scaling these +models to learn slide embeddings from the entirety of giga-pixel whole-slide +images (WSIs) remains challenging. Here, we leverage complementary information +from gene expression profiles to guide slide representation learning using +multimodal pre-training. Expression profiles constitute highly detailed +molecular descriptions of a tissue that we hypothesize offer a strong +task-agnostic training signal for learning slide embeddings. Our slide and +expression (S+E) pre-training strategy, called Tangle, employs +modality-specific encoders, the outputs of which are aligned via contrastive +learning. Tangle was pre-trained on samples from three different organs: liver +(n=6,597 S+E pairs), breast (n=1,020), and lung (n=1,012) from two different +species (Homo sapiens and Rattus norvegicus). Across three independent test +datasets consisting of 1,265 breast WSIs, 1,946 lung WSIs, and 4,584 liver +WSIs, Tangle shows significantly better few-shot performance compared to +supervised and SSL baselines. When assessed using prototype-based +classification and slide retrieval, Tangle also shows a substantial performance +improvement over all baselines. Code available at +https://github.com/mahmoodlab/TANGLE.",cs.CV,"['cs.CV', 'cs.AI']" +Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,Zhihao Cao · ZiDong Wang · Siwen Xie · Anji Liu · Lifeng Fan,https://github.com/bigai-ai/smart-help,https://arxiv.org/abs/2404.09001,,2404.09001.pdf,Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,"Despite the significant demand for assistive technology among vulnerable +groups (e.g., the elderly, children, and the disabled) in daily tasks, research +into advanced AI-driven assistive solutions that genuinely accommodate their +diverse needs remains sparse. Traditional human-machine interaction tasks often +require machines to simply help without nuanced consideration of human +abilities and feelings, such as their opportunity for practice and learning, +sense of self-improvement, and self-esteem. Addressing this gap, we define a +pivotal and novel challenge Smart Help, which aims to provide proactive yet +adaptive support to human agents with diverse disabilities and dynamic goals in +various tasks and environments. To establish this challenge, we leverage +AI2-THOR to build a new interactive 3D realistic household environment for the +Smart Help task. We introduce an innovative opponent modeling module that +provides a nuanced understanding of the main agent's capabilities and goals, in +order to optimize the assisting agent's helping policy. Rigorous experiments +validate the efficacy of our model components and show the superiority of our +holistic approach against established baselines. Our findings illustrate the +potential of AI-imbued assistive robots in improving the well-being of +vulnerable groups.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV']" +FocSAM: Delving Deeply into Focused Objects in Segmenting Anything,You Huang · Zongyu Lan · Liujuan Cao · Xianming Lin · Shengchuan Zhang · Guannan Jiang · Rongrong Ji, ,https://arxiv.org/abs/2405.18706,,2405.18706.pdf,FocSAM: Delving Deeply into Focused Objects in Segmenting Anything,"The Segment Anything Model (SAM) marks a notable milestone in segmentation +models, highlighted by its robust zero-shot capabilities and ability to handle +diverse prompts. SAM follows a pipeline that separates interactive segmentation +into image preprocessing through a large encoder and interactive inference via +a lightweight decoder, ensuring efficient real-time performance. However, SAM +faces stability issues in challenging samples upon this pipeline. These issues +arise from two main factors. Firstly, the image preprocessing disables SAM from +dynamically using image-level zoom-in strategies to refocus on the target +object during interaction. Secondly, the lightweight decoder struggles to +sufficiently integrate interactive information with image embeddings. To +address these two limitations, we propose FocSAM with a pipeline redesigned on +two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention +(Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. +Dwin-MSA localizes attention computations around the target object, enhancing +object-related embeddings with minimal computational overhead. Second, we +propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of +interactive information from a few initial clicks that have significant impacts +on the overall segmentation results. Experimentally, FocSAM augments SAM's +interactive segmentation performance to match the existing state-of-the-art +method in segmentation quality, requiring only about 5.6% of this method's +inference time on CPUs.",cs.CV,['cs.CV'] +Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation,Xiaoyang Wang · Huihui Bai · Limin Yu · Yao Zhao · Jimin Xiao, ,https://arxiv.org/abs/2403.06462v2,,2403.06462v2.pdf,Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation,"Semi-supervised semantic segmentation allows model to mine effective +supervision from unlabeled data to complement label-guided training. Recent +research has primarily focused on consistency regularization techniques, +exploring perturbation-invariant training at both the image and feature levels. +In this work, we proposed a novel feature-level consistency learning framework +named Density-Descending Feature Perturbation (DDFP). Inspired by the +low-density separation assumption in semi-supervised learning, our key insight +is that feature density can shed a light on the most promising direction for +the segmentation classifier to explore, which is the regions with lower +density. We propose to shift features with confident predictions towards +lower-density regions by perturbation injection. The perturbed features are +then supervised by the predictions on the original features, thereby compelling +the classifier to explore less dense regions to effectively regularize the +decision boundary. Central to our method is the estimation of feature density. +To this end, we introduce a lightweight density estimator based on normalizing +flow, allowing for efficient capture of the feature density distribution in an +online manner. By extracting gradients from the density estimator, we can +determine the direction towards less dense regions for each feature. The +proposed DDFP outperforms other designs on feature-level perturbations and +shows state of the art performances on both Pascal VOC and Cityscapes dataset +under various partition protocols. The project is available at +https://github.com/Gavinwxy/DDFP.",cs.CV,['cs.CV'] +Rethinking the Region Classification in Open-Vocabulary Semantic Segmentation: An Image-to-Image View,Yuan Wang · Rui Sun · Naisong Luo · Yuwen Pan · Tianzhu Zhang, ,https://arxiv.org/abs/2404.00262,,2404.00262.pdf,Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation,"Open-vocabulary semantic segmentation (OVS) aims to segment images of +arbitrary categories specified by class labels or captions. However, most +previous best-performing methods, whether pixel grouping methods or region +recognition methods, suffer from false matches between image features and +category labels. We attribute this to the natural gap between the textual +features and visual features. In this work, we rethink how to mitigate false +matches from the perspective of image-to-image matching and propose a novel +relation-aware intra-modal matching (RIM) framework for OVS based on visual +foundation models. RIM achieves robust region classification by firstly +constructing diverse image-modal reference features and then matching them with +region features based on relation-aware ranking distribution. The proposed RIM +enjoys several merits. First, the intra-modal reference features are better +aligned, circumventing potential ambiguities that may arise in cross-modal +matching. Second, the ranking-based matching process harnesses the structure +information implicit in the inter-class relationships, making it more robust +than comparing individually. Extensive experiments on three benchmarks +demonstrate that RIM outperforms previous state-of-the-art methods by large +margins, obtaining a lead of more than 10% in mIoU on PASCAL VOC benchmark.",cs.CV,['cs.CV'] +Addressing Background Context Bias in Few-Shot Segmentation through Iterative Modulation,Lanyun Zhu · Tianrun Chen · Jianxiong Yin · Simon See · Jun Liu, ,https://arxiv.org/abs/2401.08407,,2401.08407.pdf,Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining,"Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting +novel categories from a distinct domain using only limited exemplars. In this +paper, we undertake a comprehensive study of CD-FSS and uncover two crucial +insights: (i) the necessity of a fine-tuning stage to effectively transfer the +learned meta-knowledge across domains, and (ii) the overfitting risk during the +na\""ive fine-tuning due to the scarcity of novel category examples. With these +insights, we propose a novel cross-domain fine-tuning strategy that addresses +the challenging CD-FSS tasks. We first design Bi-directional Few-shot +Prediction (BFP), which establishes support-query correspondence in a +bi-directional manner, crafting augmented supervision to reduce the overfitting +risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA), which +is a recursive framework to capture the support-query correspondence +iteratively, targeting maximal exploitation of supervisory signals from the +sparse novel category samples. Extensive empirical evaluations show that our +method significantly outperforms the state-of-the-arts (+7.8\%), which verifies +that IFA tackles the cross-domain challenges and mitigates the overfitting +simultaneously. The code is available at: https://github.com/niejiahao1998/IFA.",cs.CV,['cs.CV'] +GeoChat: Grounded Large Vision-Language Model for Remote Sensing,Kartik Kuckreja · Muhammad Sohail Danish · Muzammal Naseer · Abhijit Das · Salman Khan · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2311.15826,,2311.15826.pdf,GeoChat: Grounded Large Vision-Language Model for Remote Sensing,"Recent advancements in Large Vision-Language Models (VLMs) have shown great +promise in natural image domains, allowing users to hold a dialogue about given +visual content. However, such general-domain VLMs perform poorly for Remote +Sensing (RS) scenarios, leading to inaccurate or fabricated information when +presented with RS domain-specific queries. Such a behavior emerges due to the +unique challenges introduced by RS imagery. For example, to handle +high-resolution RS imagery with diverse scale changes across categories and +many small objects, region-level reasoning is necessary alongside holistic +scene interpretation. Furthermore, the lack of domain-specific multimodal +instruction following data as well as strong backbone models for RS make it +hard for the models to align their behavior with user queries. To address these +limitations, we propose GeoChat - the first versatile remote sensing VLM that +offers multitask conversational capabilities with high-resolution RS images. +Specifically, GeoChat can not only answer image-level queries but also accepts +region inputs to hold region-specific dialogue. Furthermore, it can visually +ground objects in its responses by referring to their spatial coordinates. To +address the lack of domain-specific datasets, we generate a novel RS multimodal +instruction-following dataset by extending image-text pairs from existing +diverse RS datasets. We establish a comprehensive benchmark for RS multitask +conversations and compare with a number of baseline methods. GeoChat +demonstrates robust zero-shot performance on various RS tasks, e.g., image and +region captioning, visual question answering, scene classification, visually +grounded conversations and referring detection. Our code is available at +https://github.com/mbzuai-oryx/geochat.",cs.CV,"['cs.CV', 'cs.AI']" +RankMatch: Exploring the Better Consistency Regularization for Semi-supervised Semantic Segmentation,Huayu Mai · Rui Sun · Tianzhu Zhang · Feng Wu, ,https://arxiv.org/abs/2312.08631,,2312.08631.pdf,Semi-supervised Semantic Segmentation Meets Masked Modeling:Fine-grained Locality Learning Matters in Consistency Regularization,"Semi-supervised semantic segmentation aims to utilize limited labeled images +and abundant unlabeled images to achieve label-efficient learning, wherein the +weak-to-strong consistency regularization framework, popularized by FixMatch, +is widely used as a benchmark scheme. Despite its effectiveness, we observe +that such scheme struggles with satisfactory segmentation for the local +regions. This can be because it originally stems from the image classification +task and lacks specialized mechanisms to capture fine-grained local semantics +that prioritizes in dense prediction. To address this issue, we propose a novel +framework called \texttt{MaskMatch}, which enables fine-grained locality +learning to achieve better dense segmentation. On top of the original +teacher-student framework, we design a masked modeling proxy task that +encourages the student model to predict the segmentation given the unmasked +image patches (even with 30\% only) and enforces the predictions to be +consistent with pseudo-labels generated by the teacher model using the complete +image. Such design is motivated by the intuition that if the predictions are +more consistent given insufficient neighboring information, stronger +fine-grained locality perception is achieved. Besides, recognizing the +importance of reliable pseudo-labels in the above locality learning and the +original consistency learning scheme, we design a multi-scale ensembling +strategy that considers context at different levels of abstraction for +pseudo-label generation. Extensive experiments on benchmark datasets +demonstrate the superiority of our method against previous approaches and its +plug-and-play flexibility.",cs.CV,['cs.CV'] +Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment,Alireza Ganjdanesh · Shangqian Gao · Heng Huang, ,https://arxiv.org/abs/2403.19490,,2403.19490.pdf,Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment,"Structural model pruning is a prominent approach used for reducing the +computational cost of Convolutional Neural Networks (CNNs) before their +deployment on resource-constrained devices. Yet, the majority of proposed ideas +require a pretrained model before pruning, which is costly to secure. In this +paper, we propose a novel structural pruning approach to jointly learn the +weights and structurally prune architectures of CNN models. The core element of +our method is a Reinforcement Learning (RL) agent whose actions determine the +pruning ratios of the CNN model's layers, and the resulting model's accuracy +serves as its reward. We conduct the joint training and pruning by iteratively +training the model's weights and the agent's policy, and we regularize the +model's weights to align with the selected structure by the agent. The evolving +model's weights result in a dynamic reward function for the agent, which +prevents using prominent episodic RL methods with stationary environment +assumption for our purpose. We address this challenge by designing a mechanism +to model the complex changing dynamics of the reward function and provide a +representation of it to the RL agent. To do so, we take a learnable embedding +for each training epoch and employ a recurrent model to calculate a +representation of the changing environment. We train the recurrent model and +embeddings using a decoder model to reconstruct observed rewards. Such a design +empowers our agent to effectively leverage episodic observations along with the +environment representations to learn a proper policy to determine performant +sub-networks of the CNN model. Our extensive experiments on CIFAR-10 and +ImageNet using ResNets and MobileNets demonstrate the effectiveness of our +method.",cs.CV,['cs.CV'] +WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion,Soyong Shin · Juyong Kim · Eni Halilaj · Michael J. Black,https://wham.is.tue.mpg.de/,https://arxiv.org/abs/2312.07531,,2312.07531.pdf,WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion,"The estimation of 3D human motion from video has progressed rapidly but +current methods still have several key limitations. First, most methods +estimate the human in camera coordinates. Second, prior work on estimating +humans in global coordinates often assumes a flat ground plane and produces +foot sliding. Third, the most accurate methods rely on computationally +expensive optimization pipelines, limiting their use to offline applications. +Finally, existing video-based methods are surprisingly less accurate than +single-frame methods. We address these limitations with WHAM (World-grounded +Humans with Accurate Motion), which accurately and efficiently reconstructs 3D +human motion in a global coordinate system from video. WHAM learns to lift 2D +keypoint sequences to 3D using motion capture data and fuses this with video +features, integrating motion context and visual information. WHAM exploits +camera angular velocity estimated from a SLAM method together with human motion +to estimate the body's global trajectory. We combine this with a contact-aware +trajectory refinement method that lets WHAM capture human motion in diverse +conditions, such as climbing stairs. WHAM outperforms all existing 3D human +motion recovery methods across multiple in-the-wild benchmarks. Code will be +available for research purposes at http://wham.is.tue.mpg.de/",cs.CV,['cs.CV'] +StrokeFaceNeRF: Stroke-based Facial Appearance Editing in Neural Radiance Field,Xiao-juan Li · Dingxi Zhang · Shu-Yu Chen · Feng-Lin Liu, ,https://arxiv.org/abs/2312.09913,,,LAENeRF: Local Appearance Editing for Neural Radiance Fields,"Due to the omnipresence of Neural Radiance Fields (NeRFs), the interest +towards editable implicit 3D representations has surged over the last years. +However, editing implicit or hybrid representations as used for NeRFs is +difficult due to the entanglement of appearance and geometry encoded in the +model parameters. Despite these challenges, recent research has shown first +promising steps towards photorealistic and non-photorealistic appearance edits. +The main open issues of related work include limited interactivity, a lack of +support for local edits and large memory requirements, rendering them less +useful in practice. We address these limitations with LAENeRF, a unified +framework for photorealistic and non-photorealistic appearance editing of +NeRFs. To tackle local editing, we leverage a voxel grid as starting point for +region selection. We learn a mapping from expected ray terminations to final +output color, which can optionally be supervised by a style loss, resulting in +a framework which can perform photorealistic and non-photorealistic appearance +editing of selected regions. Relying on a single point per ray for our mapping, +we limit memory requirements and enable fast optimization. To guarantee +interactivity, we compose the output color using a set of learned, modifiable +base colors, composed with additive layer mixing. Compared to concurrent work, +LAENeRF enables recoloring and stylization while keeping processing time low. +Furthermore, we demonstrate that our approach surpasses baseline methods both +quantitatively and qualitatively.",cs.CV,['cs.CV'] +Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D,Mukund Varma T · Peihao Wang · Zhiwen Fan · Zhangyang Wang · Hao Su · Ravi Ramamoorthi,https://mukundvarmat.github.io/Lift3D/,https://arxiv.org/abs/2403.18922,,2403.18922.pdf,Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D,"In recent years, there has been an explosion of 2D vision models for numerous +tasks such as semantic segmentation, style transfer or scene editing, enabled +by large-scale 2D image datasets. At the same time, there has been renewed +interest in 3D scene representations such as neural radiance fields from +multi-view images. However, the availability of 3D or multiview data is still +substantially limited compared to 2D image datasets, making extending 2D vision +models to 3D data highly desirable but also very challenging. Indeed, extending +a single 2D vision operator like scene editing to 3D typically requires a +highly creative method specialized to that task and often requires per-scene +optimization. In this paper, we ask the question of whether any 2D vision model +can be lifted to make 3D consistent predictions. We answer this question in the +affirmative; our new Lift3D method trains to predict unseen views on feature +spaces generated by a few visual models (i.e. DINO and CLIP), but then +generalizes to novel vision operators and tasks, such as style transfer, +super-resolution, open vocabulary segmentation and image colorization; for some +of these tasks, there is no comparable previous 3D method. In many cases, we +even outperform state-of-the-art methods specialized for the task in question. +Moreover, Lift3D is a zero-shot method, in the sense that it requires no +task-specific training, nor scene-specific optimization.",cs.CV,['cs.CV'] +Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model,Tian Liang · Jing Huang · Ming Kong · Luyuan Chen · Qiang Zhu, ,https://arxiv.org/html/2405.20654v1,,2405.20654v1.pdf,Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models,"Effective passage retrieval and reranking methods have been widely utilized +to identify suitable candidates in open-domain question answering tasks, recent +studies have resorted to LLMs for reranking the retrieved passages by the +log-likelihood of the question conditioned on each passage. Although these +methods have demonstrated promising results, the performance is notably +sensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs +can be computationally intensive and time-consuming. Furthermore, this approach +limits the leverage of question-passage relevance pairs and passage-specific +knowledge to enhance the ranking capabilities of LLMs. In this paper, we +propose passage-specific prompt tuning for reranking in open-domain question +answering (PSPT): a parameter-efficient method that fine-tunes learnable +passage-specific soft prompts, incorporating passage-specific knowledge from a +limited set of question-passage relevance pairs. The method involves ranking +retrieved passages based on the log-likelihood of the model generating the +question conditioned on each passage and the learned soft prompt. We conducted +extensive experiments utilizing the Llama-2-chat-7B model across three publicly +available open-domain question answering datasets and the results demonstrate +the effectiveness of the proposed approach.",cs.CL,"['cs.CL', 'cs.IR']" +PairDETR : Joint Detection and Association of Human Bodies and Faces,Ammar Ali · Georgii Gaikov · Denis Rybalchenko · Alexander Chigorin · Ivan Laptev · Sergey Zagoruyko, ,https://arxiv.org/abs/2404.08450,,2404.08450.pdf,Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues,"Face recognition systems are frequently subjected to a variety of physical +and digital attacks of different types. Previous methods have achieved +satisfactory performance in scenarios that address physical attacks and digital +attacks, respectively. However, few methods are considered to integrate a model +that simultaneously addresses both physical and digital attacks, implying the +necessity to develop and maintain multiple models. To jointly detect physical +and digital attacks within a single model, we propose an innovative approach +that can adapt to any network architecture. Our approach mainly contains two +types of data augmentation, which we call Simulated Physical Spoofing Clues +augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). +SPSC and SDSC augment live samples into simulated attack samples by simulating +spoofing clues of physical and digital attacks, respectively, which +significantly improve the capability of the model to detect ""unseen"" attack +types. Extensive experiments show that SPSC and SDSC can achieve +state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData +dataset, respectively. Our method won first place in ""Unified Physical-Digital +Face Attack Detection"" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our +final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, +respectively. Our code is available at +https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge.",cs.CV,['cs.CV'] +Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis,Zicheng Zhang · RUOBING ZHENG · Bonan Li · Congying Han · Tianqi Li · Meng Wang · Tiande Guo · Jingdong Chen · Ziwen Liu · Ming Yang, ,https://arxiv.org/abs/2402.17364v1,,2402.17364v1.pdf,Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis,"Recent works in implicit representations, such as Neural Radiance Fields +(NeRF), have advanced the generation of realistic and animatable head avatars +from video sequences. These implicit methods are still confronted by visual +artifacts and jitters, since the lack of explicit geometric constraints poses a +fundamental challenge in accurately modeling complex facial deformations. In +this paper, we introduce Dynamic Tetrahedra (DynTet), a novel hybrid +representation that encodes explicit dynamic meshes by neural networks to +ensure geometric consistency across various motions and viewpoints. DynTet is +parameterized by the coordinate-based networks which learn signed distance, +deformation, and material texture, anchoring the training data into a +predefined tetrahedra grid. Leveraging Marching Tetrahedra, DynTet efficiently +decodes textured meshes with a consistent topology, enabling fast rendering +through a differentiable rasterizer and supervision via a pixel loss. To +enhance training efficiency, we incorporate classical 3D Morphable Models to +facilitate geometry learning and define a canonical space for simplifying +texture learning. These advantages are readily achievable owing to the +effective geometric representation employed in DynTet. Compared with prior +works, DynTet demonstrates significant improvements in fidelity, lip +synchronization, and real-time performance according to various metrics. Beyond +producing stable and visually appealing synthesis videos, our method also +outputs the dynamic meshes which is promising to enable many emerging +applications.",cs.CV,['cs.CV'] +SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering,Antoine Guédon · Vincent Lepetit,https://anttwo.github.io/sugar/,,https://huggingface.co/papers/2311.12775,,,,,nan +Composed Video Retrieval via Enriched Context and Discriminative Embeddings,Omkar Thawakar · Muzammal Naseer · Rao Anwer · Salman Khan · Michael Felsberg · Mubarak Shah · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2403.16997,,2403.16997.pdf,Composed Video Retrieval via Enriched Context and Discriminative Embeddings,"Composed video retrieval (CoVR) is a challenging problem in computer vision +which has recently highlighted the integration of modification text with visual +queries for more sophisticated video search in large databases. Existing works +predominantly rely on visual queries combined with modification text to +distinguish relevant videos. However, such a strategy struggles to fully +preserve the rich query-specific context in retrieved target videos and only +represents the target video using visual embedding. We introduce a novel CoVR +framework that leverages detailed language descriptions to explicitly encode +query-specific contextual information and learns discriminative embeddings of +vision only, text only and vision-text for better alignment to accurately +retrieve matched target videos. Our proposed framework can be flexibly employed +for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on +three datasets show that our approach obtains state-of-the-art performance for +both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in +terms of recall@K=1 score. Our code, models, detailed language descriptions for +WebViD-CoVR dataset are available at +\url{https://github.com/OmkarThawakar/composed-video-retrieval}",cs.CV,['cs.CV'] +Distilling Vision-Language Models on Millions of Videos,Yue Zhao · Long Zhao · Xingyi Zhou · Jialin Wu · Chun-Te Chu · Hui Miao · Florian Schroff · Hartwig Adam · Ting Liu · Boqing Gong · Philipp Krähenbühl · Liangzhe Yuan, ,https://arxiv.org/abs/2401.06129,,2401.06129.pdf,Distilling Vision-Language Models on Millions of Videos,"The recent advance in vision-language models is largely attributed to the +abundance of image-text data. We aim to replicate this success for +video-language models, but there simply is not enough human-curated video-text +data available. We thus resort to fine-tuning a video-language model from a +strong image-language baseline with synthesized instructional data. The +resulting video model by video-instruction-tuning (VIIT) is then used to +auto-label millions of videos to generate high-quality captions. We show the +adapted video-language model performs well on a wide range of video-language +benchmarks. For instance, it surpasses the best prior result on open-ended +NExT-QA by 2.8%. Besides, our model generates detailed descriptions for +previously unseen videos, which provide better textual supervision than +existing methods. Experiments show that a video-language dual-encoder model +contrastively trained on these auto-generated captions is 3.8% better than the +strongest baseline that also leverages vision-language models. Our best model +outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video +retrieval by 6%. As a side product, we generate the largest video caption +dataset to date.",cs.CV,['cs.CV'] +ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing,Junkun Chen · Samuel Rota Bulò · Norman Müller · Lorenzo Porzi · Peter Kontschieder · Yu-Xiong Wang, ,https://arxiv.org/abs/2308.13223,,,EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior,"While image diffusion models have made significant progress in text-driven 3D +content creation, they often fail to accurately capture the intended meaning of +text prompts, especially for view information. This limitation leads to the +Janus problem, where multi-faced 3D models are generated under the guidance of +such diffusion models. In this paper, we propose a robust high-quality 3D +content generation pipeline by exploiting orthogonal-view image guidance. +First, we introduce a novel 2D diffusion model that generates an image +consisting of four orthogonal-view sub-images based on the given text prompt. +Then, the 3D content is created using this diffusion model. Notably, the +generated orthogonal-view image provides strong geometric structure priors and +thus improves 3D consistency. As a result, it effectively resolves the Janus +problem and significantly enhances the quality of 3D content creation. +Additionally, we present a 3D synthesis fusion network that can further improve +the details of the generated 3D contents. Both quantitative and qualitative +evaluations demonstrate that our method surpasses previous text-to-3D +techniques. Project page: https://efficientdreamer.github.io.",cs.CV,['cs.CV'] +Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness,Guangzhi Wang · Yangyang Guo · Ziwei Xu · Mohan Kankanhalli, ,https://arxiv.org/abs/2307.10499,,,Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection,"Human-Object Interaction Detection is a crucial aspect of human-centric scene +understanding, with important applications in various domains. Despite recent +progress in this field, recognizing subtle and detailed interactions remains +challenging. Existing methods try to use human-related clues to alleviate the +difficulty, but rely heavily on external annotations or knowledge, limiting +their practical applicability in real-world scenarios. In this work, we propose +a novel Part Semantic Network (PSN) to solve this problem. The core of PSN is a +Conditional Part Attention (CPA) mechanism, where human features are taken as +keys and values, and the object feature is used as query for the computation in +a cross-attention mechanism. In this way, our model learns to automatically +focus on the most informative human parts conditioned on the involved object, +generating more semantically meaningful features for interaction recognition. +Additionally, we propose an Occluded Part Extrapolation (OPE) strategy to +facilitate interaction recognition under occluded scenarios, which teaches the +model to extrapolate detailed features from partially occluded ones. Our method +consistently outperforms prior approaches on the V-COCO and HICO-DET datasets, +without external data or extra annotations. Additional ablation studies +validate the effectiveness of each component of our proposed method.",cs.CV,['cs.CV'] +Multi-modal learning for geospatial vegetation forecasting,Vitus Benson · Claire Robin · Christian Requena-Mesa · LAZARO ALONSO SILVA · Mélanie Weynants · Nora Linscheid · Jose Cortes · Zhihan Gao · Nuno Carvalhais · Markus Reichstein, ,https://arxiv.org/html/2405.20161v1,,2405.20161v1.pdf,Landslide mapping from Sentinel-2 imagery through change detection,"Landslides are one of the most critical and destructive geohazards. +Widespread development of human activities and settlements combined with the +effects of climate change on weather are resulting in a high increase in the +frequency and destructive power of landslides, making them a major threat to +human life and the economy. In this paper, we explore methodologies to map +newly-occurred landslides using Sentinel-2 imagery automatically. All +approaches presented are framed as a bi-temporal change detection problem, +requiring only a pair of Sentinel-2 images, taken respectively before and after +a landslide-triggering event. Furthermore, we introduce a novel deep learning +architecture for fusing Sentinel-2 bi-temporal image pairs with Digital +Elevation Model (DEM) data, showcasing its promising performances w.r.t. other +change detection models in the literature. As a parallel task, we address +limitations in existing datasets by creating a novel geodatabase, which +includes manually validated open-access landslide inventories over +heterogeneous ecoregions of the world. We release both code and dataset with an +open-source license.",cs.CV,"['cs.CV', 'eess.IV']" +LISA: Reasoning Segmentation via Large Language Model,Xin Lai · Zhuotao Tian · Yukang Chen · Yanwei Li · Yuhui Yuan · Shu Liu · Jiaya Jia, ,https://arxiv.org/abs/2308.00692,,2308.00692.pdf,LISA: Reasoning Segmentation via Large Language Model,"Although perception systems have made remarkable advancements in recent +years, they still rely on explicit human instruction or pre-defined categories +to identify the target objects before executing visual recognition tasks. Such +systems cannot actively reason and comprehend implicit user intention. In this +work, we propose a new segmentation task -- reasoning segmentation. The task is +designed to output a segmentation mask given a complex and implicit query text. +Furthermore, we establish a benchmark comprising over one thousand +image-instruction-mask data samples, incorporating intricate reasoning and +world knowledge for evaluation purposes. Finally, we present LISA: large +Language Instructed Segmentation Assistant, which inherits the language +generation capabilities of multimodal Large Language Models (LLMs) while also +possessing the ability to produce segmentation masks. We expand the original +vocabulary with a token and propose the embedding-as-mask paradigm to +unlock the segmentation capability. Remarkably, LISA can handle cases involving +complex reasoning and world knowledge. Also, it demonstrates robust zero-shot +capability when trained exclusively on reasoning-free datasets. In addition, +fine-tuning the model with merely 239 reasoning segmentation data samples +results in further performance enhancement. Both quantitative and qualitative +experiments show our method effectively unlocks new reasoning segmentation +capabilities for multimodal LLMs. Code, models, and data are available at +https://github.com/dvlab-research/LISA.",cs.CV,['cs.CV'] +Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion,Linzhan Mou · Junkun Chen · Yu-Xiong Wang, ,https://arxiv.org/abs/2306.09551,,2306.09551.pdf,Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model,"Recent research has demonstrated that the combination of pretrained diffusion +models with neural radiance fields (NeRFs) has emerged as a promising approach +for text-to-3D generation. Simply coupling NeRF with diffusion models will +result in cross-view inconsistency and degradation of stylized view syntheses. +To address this challenge, we propose the Edit-DiffNeRF framework, which is +composed of a frozen diffusion model, a proposed delta module to edit the +latent semantic space of the diffusion model, and a NeRF. Instead of training +the entire diffusion for each scene, our method focuses on editing the latent +semantic space in frozen pretrained diffusion models by the delta module. This +fundamental change to the standard diffusion framework enables us to make +fine-grained modifications to the rendered views and effectively consolidate +these instructions in a 3D scene via NeRF training. As a result, we are able to +produce an edited 3D scene that faithfully aligns to input text instructions. +Furthermore, to ensure semantic consistency across different viewpoints, we +propose a novel multi-view semantic consistency loss that extracts a latent +semantic embedding from the input view as a prior, and aim to reconstruct it in +different views. Our proposed method has been shown to effectively edit +real-world 3D scenes, resulting in 25% improvement in the alignment of the +performed 3D edits with text instructions compared to prior work.",cs.CV,['cs.CV'] +RepAn: Enhanced Annealing through Re-parameterization,Xiang Fei · Xiawu Zheng · Yan Wang · Fei Chao · Chenglin Wu · Liujuan Cao, ,,https://dilithjay.com/blog/the-reparameterization-trick-clearly-explained,,,,,nan +EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything,Yunyang Xiong · Balakrishnan Varadarajan · Lemeng Wu · Xiaoyu Xiang · Fanyi Xiao · Chenchen Zhu · Xiaoliang Dai · Dilin Wang · Fei Sun · Forrest Iandola · Raghuraman Krishnamoorthi · Vikas Chandra, ,https://arxiv.org/abs/2312.00863,,2312.00863.pdf,EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything,"Segment Anything Model (SAM) has emerged as a powerful tool for numerous +vision applications. A key component that drives the impressive performance for +zero-shot transfer and high versatility is a super large Transformer model +trained on the extensive high-quality SA-1B dataset. While beneficial, the huge +computation cost of SAM model has limited its applications to wider real-world +applications. To address this limitation, we propose EfficientSAMs, +light-weight SAM models that exhibits decent performance with largely reduced +complexity. Our idea is based on leveraging masked image pretraining, SAMI, +which learns to reconstruct features from SAM image encoder for effective +visual representation learning. Further, we take SAMI-pretrained light-weight +image encoders and mask decoder to build EfficientSAMs, and finetune the models +on SA-1B for segment anything task. We perform evaluations on multiple vision +tasks including image classification, object detection, instance segmentation, +and semantic object detection, and find that our proposed pretraining method, +SAMI, consistently outperforms other masked image pretraining methods. On +segment anything task such as zero-shot instance segmentation, our +EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably +with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.",cs.CV,['cs.CV'] +Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning,Zhengwei Fang · Rui Wang · Tao Huang · Liping Jing, ,https://arxiv.org/abs/2308.02897,,2308.02897.pdf,An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability,"While the transferability property of adversarial examples allows the +adversary to perform black-box attacks (i.e., the attacker has no knowledge +about the target model), the transfer-based adversarial attacks have gained +great attention. Previous works mostly study gradient variation or image +transformations to amplify the distortion on critical parts of inputs. These +methods can work on transferring across models with limited differences, i.e., +from CNNs to CNNs, but always fail in transferring across models with wide +differences, such as from CNNs to ViTs. Alternatively, model ensemble +adversarial attacks are proposed to fuse outputs from surrogate models with +diverse architectures to get an ensemble loss, making the generated adversarial +example more likely to transfer to other models as it can fool multiple models +concurrently. However, existing ensemble attacks simply fuse the outputs of the +surrogate models evenly, thus are not efficacious to capture and amplify the +intrinsic transfer information of adversarial examples. In this paper, we +propose an adaptive ensemble attack, dubbed AdaEA, to adaptively control the +fusion of the outputs from each model, via monitoring the discrepancy ratio of +their contributions towards the adversarial objective. Furthermore, an extra +disparity-reduced filter is introduced to further synchronize the update +direction. As a result, we achieve considerable improvement over the existing +ensemble attacks on various datasets, and the proposed AdaEA can also boost +existing transfer-based attacks, which further demonstrates its efficacy and +versatility.",cs.CV,['cs.CV'] +AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning,Duojun Huang · Xinyu Xiong · Jie Ma · Jichang Li · Zequn Jie · Lin Ma · Guanbin Li, ,https://arxiv.org/abs/2312.03628,,2312.03628.pdf,Boosting Segment Anything Model Towards Open-Vocabulary Learning,"The recent Segment Anything Model (SAM) has emerged as a new paradigmatic +vision foundation model, showcasing potent zero-shot generalization and +flexible prompting. Despite SAM finding applications and adaptations in various +domains, its primary limitation lies in the inability to grasp object +semantics. In this paper, we present Sambor to seamlessly integrate SAM with +the open-vocabulary object detector in an end-to-end framework. While retaining +all the remarkable capabilities inherent to SAM, we enhance it with the +capacity to detect arbitrary objects based on human inputs like category names +or reference expressions. To accomplish this, we introduce a novel SideFormer +module that extracts SAM features to facilitate zero-shot object localization +and inject comprehensive semantic information for open-vocabulary recognition. +In addition, we devise an open-set region proposal network (Open-set RPN), +enabling the detector to acquire the open-set proposals generated by SAM. +Sambor demonstrates superior zero-shot performance across benchmarks, including +COCO and LVIS, proving highly competitive against previous SoTA methods. We +aspire for this work to serve as a meaningful endeavor in endowing SAM to +recognize diverse object categories and advancing open-vocabulary learning with +the support of vision foundation models.",cs.CV,['cs.CV'] +Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering,Zaid Khan · Yun Fu, ,https://arxiv.org/abs/2404.10193,,2404.10193.pdf,Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering,"The goal of selective prediction is to allow an a model to abstain when it +may not be able to deliver a reliable prediction, which is important in +safety-critical contexts. Existing approaches to selective prediction typically +require access to the internals of a model, require retraining a model or study +only unimodal models. However, the most powerful models (e.g. GPT-4) are +typically only available as black boxes with inaccessible internals, are not +retrainable by end-users, and are frequently used for multimodal tasks. We +study the possibility of selective prediction for vision-language models in a +realistic, black-box setting. We propose using the principle of +\textit{neighborhood consistency} to identify unreliable responses from a +black-box vision-language model in question answering tasks. We hypothesize +that given only a visual question and model response, the consistency of the +model's responses over the neighborhood of a visual question will indicate +reliability. It is impossible to directly sample neighbors in feature space in +a black-box setting. Instead, we show that it is possible to use a smaller +proxy model to approximately sample from the neighborhood. We find that +neighborhood consistency can be used to identify model responses to visual +questions that are likely unreliable, even in adversarial settings or settings +that are out-of-distribution to the proxy model.",cs.CV,['cs.CV'] +Distribution-aware Knowledge Prototyping for Non-exemplar Lifelong Person Re-identification,Kunlun Xu · Xu Zou · Yuxin Peng · Jiahuan Zhou, ,https://arxiv.org/abs/2405.19005,,2405.19005.pdf,Auto-selected Knowledge Adapters for Lifelong Person Re-identification,"Lifelong Person Re-Identification (LReID) extends traditional ReID by +requiring systems to continually learn from non-overlapping datasets across +different times and locations, adapting to new identities while preserving +knowledge of previous ones. Existing approaches, either rehearsal-free or +rehearsal-based, still suffer from the problem of catastrophic forgetting since +they try to cram diverse knowledge into one fixed model. To overcome this +limitation, we introduce a novel framework AdalReID, that adopts knowledge +adapters and a parameter-free auto-selection mechanism for lifelong learning. +Concretely, we incrementally build distinct adapters to learn domain-specific +knowledge at each step, which can effectively learn and preserve knowledge +across different datasets. Meanwhile, the proposed auto-selection strategy +adaptively calculates the knowledge similarity between the input set and the +adapters. On the one hand, the appropriate adapters are selected for the inputs +to process ReID, and on the other hand, the knowledge interaction and fusion +between adapters are enhanced to improve the generalization ability of the +model. Extensive experiments are conducted to demonstrate the superiority of +our AdalReID, which significantly outperforms SOTAs by about 10$\sim$20\% mAP +on both seen and unseen domains.",cs.CV,['cs.CV'] +Looking 3D: Anomaly Detection with 2D-3D Alignment,Ankan Kumar Bhunia · Changjian Li · Hakan Bilen,https://github.com/VICO-UoE/Looking3D,https://arxiv.org/abs/2311.14897,,2311.14897.pdf,Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network,"Recently, 3D anomaly detection, a crucial problem involving fine-grained +geometry discrimination, is getting more attention. However, the lack of +abundant real 3D anomaly data limits the scalability of current models. To +enable scalable anomaly data collection, we propose a 3D anomaly synthesis +pipeline to adapt existing large-scale 3Dmodels for 3D anomaly detection. +Specifically, we construct a synthetic dataset, i.e., Anomaly-ShapeNet, basedon +ShapeNet. Anomaly-ShapeNet consists of 1600 point cloud samples under 40 +categories, which provides a rich and varied collection of data, enabling +efficient training and enhancing adaptability to industrial scenarios. +Meanwhile,to enable scalable representation learning for 3D anomaly +localization, we propose a self-supervised method, i.e., Iterative Mask +Reconstruction Network (IMRNet). During training, we propose a geometry-aware +sample module to preserve potentially anomalous local regions during point +cloud down-sampling. Then, we randomly mask out point patches and sent the +visible patches to a transformer for reconstruction-based self-supervision. +During testing, the point cloud repeatedly goes through the Mask Reconstruction +Network, with each iteration's output becoming the next input. By merging and +contrasting the final reconstructed point cloud with the initial input, our +method successfully locates anomalies. Experiments show that IMRNet outperforms +previous state-of-the-art methods, achieving 66.1% in I-AUC on Anomaly-ShapeNet +dataset and 72.5% in I-AUC on Real3D-AD dataset. Our dataset will be released +at https://github.com/Chopper-233/Anomaly-ShapeNet",cs.CV,['cs.CV'] +Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network,wenqiao Li · Xiaohao Xu · Yao Gu · BoZhong Zheng · Shenghua Gao · Yingna Wu, ,https://arxiv.org/abs/2311.14897,,,Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network,"Recently, 3D anomaly detection, a crucial problem involving fine-grained +geometry discrimination, is getting more attention. However, the lack of +abundant real 3D anomaly data limits the scalability of current models. To +enable scalable anomaly data collection, we propose a 3D anomaly synthesis +pipeline to adapt existing large-scale 3Dmodels for 3D anomaly detection. +Specifically, we construct a synthetic dataset, i.e., Anomaly-ShapeNet, basedon +ShapeNet. Anomaly-ShapeNet consists of 1600 point cloud samples under 40 +categories, which provides a rich and varied collection of data, enabling +efficient training and enhancing adaptability to industrial scenarios. +Meanwhile,to enable scalable representation learning for 3D anomaly +localization, we propose a self-supervised method, i.e., Iterative Mask +Reconstruction Network (IMRNet). During training, we propose a geometry-aware +sample module to preserve potentially anomalous local regions during point +cloud down-sampling. Then, we randomly mask out point patches and sent the +visible patches to a transformer for reconstruction-based self-supervision. +During testing, the point cloud repeatedly goes through the Mask Reconstruction +Network, with each iteration's output becoming the next input. By merging and +contrasting the final reconstructed point cloud with the initial input, our +method successfully locates anomalies. Experiments show that IMRNet outperforms +previous state-of-the-art methods, achieving 66.1% in I-AUC on Anomaly-ShapeNet +dataset and 72.5% in I-AUC on Real3D-AD dataset. Our dataset will be released +at https://github.com/Chopper-233/Anomaly-ShapeNet",cs.CV,['cs.CV'] +Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data,Lihe Yang · Bingyi Kang · Zilong Huang · Xiaogang Xu · Jiashi Feng · Hengshuang Zhao, ,https://arxiv.org/abs/2401.10891,,2401.10891.pdf,Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data,"This work presents Depth Anything, a highly practical solution for robust +monocular depth estimation. Without pursuing novel technical modules, we aim to +build a simple yet powerful foundation model dealing with any images under any +circumstances. To this end, we scale up the dataset by designing a data engine +to collect and automatically annotate large-scale unlabeled data (~62M), which +significantly enlarges the data coverage and thus is able to reduce the +generalization error. We investigate two simple yet effective strategies that +make data scaling-up promising. First, a more challenging optimization target +is created by leveraging data augmentation tools. It compels the model to +actively seek extra visual knowledge and acquire robust representations. +Second, an auxiliary supervision is developed to enforce the model to inherit +rich semantic priors from pre-trained encoders. We evaluate its zero-shot +capabilities extensively, including six public datasets and randomly captured +photos. It demonstrates impressive generalization ability. Further, through +fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs +are set. Our better depth model also results in a better depth-conditioned +ControlNet. Our models are released at +https://github.com/LiheYoung/Depth-Anything.",cs.CV,['cs.CV'] +SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World,Kiana Ehsani · Tanmay Gupta · Rose Hendrix · Jordi Salvador · Luca Weihs · Kuo-Hao Zeng · Kunal Singh Singh · Yejin Kim · Winson Han · Alvaro Herrasti · Ranjay Krishna · Dustin Schwenk · Eli VanderBilt · Aniruddha Kembhavi, ,https://arxiv.org/abs/2312.02976,,2312.02976.pdf,Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World,"Reinforcement learning (RL) with dense rewards and imitation learning (IL) +with human-generated trajectories are the most widely used approaches for +training modern embodied agents. RL requires extensive reward shaping and +auxiliary losses and is often too slow and ineffective for long-horizon tasks. +While IL with human supervision is effective, collecting human trajectories at +scale is extremely expensive. In this work, we show that imitating +shortest-path planners in simulation produces agents that, given a language +instruction, can proficiently navigate, explore, and manipulate objects in both +simulation and in the real world using only RGB sensors (no depth map or GPS +coordinates). This surprising result is enabled by our end-to-end, +transformer-based, SPOC architecture, powerful visual encoders paired with +extensive image augmentation, and the dramatic scale and diversity of our +training data: millions of frames of shortest-path-expert trajectories +collected inside approximately 200,000 procedurally generated houses containing +40,000 unique 3D assets. Our models, data, training code, and newly proposed +10-task benchmarking suite CHORES will be open-sourced.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV']" +A Unified and Interpretable Emotion Representation and Expression Generation,Reni Paskaleva · Mykyta Holubakha · Andela Ilic · Saman Motamed · Luc Van Gool · Danda Paudel,https://emotion-diffusion.github.io/,https://arxiv.org/abs/2404.01243,,2404.01243.pdf,A Unified and Interpretable Emotion Representation and Expression Generation,"Canonical emotions, such as happy, sad, and fearful, are easy to understand +and annotate. However, emotions are often compound, e.g. happily surprised, and +can be mapped to the action units (AUs) used for expressing emotions, and +trivially to the canonical ones. Intuitively, emotions are continuous as +represented by the arousal-valence (AV) model. An interpretable unification of +these four modalities - namely, Canonical, Compound, AUs, and AV - is highly +desirable, for a better representation and understanding of emotions. However, +such unification remains to be unknown in the current literature. In this work, +we propose an interpretable and unified emotion model, referred as C2A2. We +also develop a method that leverages labels of the non-unified models to +annotate the novel unified one. Finally, we modify the text-conditional +diffusion models to understand continuous numbers, which are then used to +generate continuous expressions using our unified emotion model. Through +quantitative and qualitative experiments, we show that our generated images are +rich and capture subtle expressions. Our work allows a fine-grained generation +of expressions in conjunction with other textual inputs and offers a new label +space for emotions at the same time.",cs.CV,['cs.CV'] +Regularized Parameter Uncertainty for Improving Generalization in Reinforcement Learning,Pehuen Moure · Longbiao Cheng · Joachim Ott · Zuowen Wang · Shih-Chii Liu, ,,https://arxiv.org/pdf/2207.02016v4,,,,,nan +Understanding Video Transfomers via Universal Concept Discovery,Matthew Kowal · Achal Dave · Rares Andrei Ambrus · Adrien Gaidon · Kosta Derpanis · Pavel Tokmakov,https://yorkucvil.github.io/VTCD/,https://arxiv.org/abs/2401.10831,,,Understanding Video Transformers via Universal Concept Discovery,"This paper studies the problem of concept-based interpretability of +transformer representations for videos. Concretely, we seek to explain the +decision-making process of video transformers based on high-level, +spatiotemporal concepts that are automatically discovered. Prior research on +concept-based interpretability has concentrated solely on image-level tasks. +Comparatively, video models deal with the added temporal dimension, increasing +complexity and posing challenges in identifying dynamic concepts over time. In +this work, we systematically address these challenges by introducing the first +Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose +an efficient approach for unsupervised identification of units of video +transformer representations - concepts, and ranking their importance to the +output of a model. The resulting concepts are highly interpretable, revealing +spatio-temporal reasoning mechanisms and object-centric representations in +unstructured video models. Performing this analysis jointly over a diverse set +of supervised and self-supervised representations, we discover that some of +these mechanism are universal in video transformers. Finally, we show that VTCD +can be used for fine-grained action recognition and video object segmentation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']" +HIPTrack: Visual Tracking with Historical Prompts,Wenrui Cai · Qingjie Liu · Yunhong Wang, ,https://arxiv.org/abs/2311.02072,,2311.02072.pdf,HIPTrack: Visual Tracking with Historical Prompts,"Trackers that follow Siamese paradigm utilize similarity matching between +template and search region features for tracking. Many methods have been +explored to enhance tracking performance by incorporating tracking history to +better handle scenarios involving target appearance variations such as +deformation and occlusion. However, the utilization of historical information +in existing methods is insufficient and incomprehensive, which typically +requires repetitive training and introduces a large amount of computation. In +this paper, we show that by providing a tracker that follows Siamese paradigm +with precise and updated historical information, a significant performance +improvement can be achieved with completely unchanged parameters. Based on +this, we propose a historical prompt network that uses refined historical +foreground masks and historical visual features of the target to provide +comprehensive and precise prompts for the tracker. We build a novel tracker +called HIPTrack based on the historical prompt network, which achieves +considerable performance improvements without the need to retrain the entire +model. We conduct experiments on seven datasets and experimental results +demonstrate that our method surpasses the current state-of-the-art trackers on +LaSOT, LaSOText, GOT-10k and NfS. Furthermore, the historical prompt network +can seamlessly integrate as a plug-and-play module into existing trackers, +providing performance enhancements. The source code is available at +https://github.com/WenRuiCai/HIPTrack.",cs.CV,['cs.CV'] +Self-supervised Representation Learning from Arbitrary Scenarios,Zhaowen Li · Yousong Zhu · Zhiyang Chen · Zongxin Gao · Rui Zhao · Chaoyang Zhao · Ming Tang · Jinqiao Wang, ,https://arxiv.org/abs/2403.03740,,2403.03740.pdf,Self-supervised Photographic Image Layout Representation Learning,"In the domain of image layout representation learning, the critical process +of translating image layouts into succinct vector forms is increasingly +significant across diverse applications, such as image retrieval, manipulation, +and generation. Most approaches in this area heavily rely on costly labeled +datasets and notably lack in adapting their modeling and learning methods to +the specific nuances of photographic image layouts. This shortfall makes the +learning process for photographic image layouts suboptimal. In our research, we +directly address these challenges. We innovate by defining basic layout +primitives that encapsulate various levels of layout information and by mapping +these, along with their interconnections, onto a heterogeneous graph structure. +This graph is meticulously engineered to capture the intricate layout +information within the pixel domain explicitly. Advancing further, we introduce +novel pretext tasks coupled with customized loss functions, strategically +designed for effective self-supervised learning of these layout graphs. +Building on this foundation, we develop an autoencoder-based network +architecture skilled in compressing these heterogeneous layout graphs into +precise, dimensionally-reduced layout representations. Additionally, we +introduce the LODB dataset, which features a broader range of layout categories +and richer semantics, serving as a comprehensive benchmark for evaluating the +effectiveness of layout representation learning methods. Our extensive +experimentation on this dataset demonstrates the superior performance of our +approach in the realm of photographic image layout representation learning.",cs.CV,"['cs.CV', 'cs.MM']" +Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation,Yunhao Ge · Xiaohui Zeng · Jacob Huffman · Tsung-Yi Lin · Ming-Yu Liu · Yin Cui, ,https://arxiv.org/abs/2404.19752,,2404.19752.pdf,Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation,"Existing automatic captioning methods for visual content face challenges such +as lack of detail, content hallucination, and poor instruction following. In +this work, we propose VisualFactChecker (VFC), a flexible training-free +pipeline that generates high-fidelity and detailed captions for both 2D images +and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text +captioning models propose multiple initial captions; 2) verification, where a +large language model (LLM) utilizes tools such as object detection and VQA +models to fact-check proposed captions; 3) captioning, where an LLM generates +the final caption by summarizing caption proposals and the fact check +verification results. In this step, VFC can flexibly generate captions in +various styles following complex instructions. We conduct comprehensive +captioning evaluations using four metrics: 1) CLIP-Score for image-text +similarity; 2) CLIP-Image-Score for measuring the image-image similarity +between the original and the reconstructed image generated by a text-to-image +model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V +for fine-grained evaluation. Evaluation results show that VFC outperforms +state-of-the-art open-sourced captioning methods for 2D images on the COCO +dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by +combining open-source models into a pipeline, we can attain captioning +capability comparable to proprietary models such as GPT-4V, despite being over +10x smaller in model size.",cs.CV,['cs.CV'] +NViST: In the Wild New View Synthesis from a Single Image with Transformers,Wonbong Jang · Lourdes Agapito, ,https://arxiv.org/abs/2312.08568,,2312.08568.pdf,NViST: In the Wild New View Synthesis from a Single Image with Transformers,"We propose NViST, a transformer-based model for efficient and generalizable +novel-view synthesis from a single image for real-world scenes. In contrast to +many methods that are trained on synthetic data, object-centred scenarios, or +in a category-specific manner, NViST is trained on MVImgNet, a large-scale +dataset of casually-captured real-world videos of hundreds of object categories +with diverse backgrounds. NViST transforms image inputs directly into a +radiance field, conditioned on camera parameters via adaptive layer +normalisation. In practice, NViST exploits fine-tuned masked autoencoder (MAE) +features and translates them to 3D output tokens via cross-attention, while +addressing occlusions with self-attention. To move away from object-centred +datasets and enable full scene synthesis, NViST adopts a 6-DOF camera pose +model and only requires relative pose, dropping the need for canonicalization +of the training data, which removes a substantial barrier to it being used on +casually captured datasets. We show results on unseen objects and categories +from MVImgNet and even generalization to casual phone captures. We conduct +qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that +our model represents a step forward towards enabling true in-the-wild +generalizable novel-view synthesis from a single image. Project webpage: +https://wbjang.github.io/nvist_webpage.",cs.CV,['cs.CV'] +Space-time Diffusion Features for Zero-shot Text-driven Motion Transfer,Rafail Fridman · Danah Yatim · Omer Bar-Tal · Yoni Kasten · Tali Dekel,https://diffusion-motion-transfer.github.io/,https://arxiv.org/abs/2311.17009,,2311.17009.pdf,Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer,"We present a new method for text-driven motion transfer - synthesizing a +video that complies with an input text prompt describing the target objects and +scene while maintaining an input video's motion and scene layout. Prior methods +are confined to transferring motion across two subjects within the same or +closely related object categories and are applicable for limited domains (e.g., +humans). In this work, we consider a significantly more challenging setting in +which the target and source objects differ drastically in shape and +fine-grained motion characteristics (e.g., translating a jumping dog into a +dolphin). To this end, we leverage a pre-trained and fixed text-to-video +diffusion model, which provides us with generative and motion priors. The +pillar of our method is a new space-time feature loss derived directly from the +model. This loss guides the generation process to preserve the overall motion +of the input video while complying with the target object in terms of shape and +fine-grained motion traits.",cs.CV,['cs.CV'] +COLMAP-Free 3D Gaussian Splatting,Yang Fu · Sifei Liu · Amey Kulkarni · Jan Kautz · Alexei A. Efros · Xiaolong Wang, ,https://arxiv.org/abs/2312.07504,,2312.07504.pdf,COLMAP-Free 3D Gaussian Splatting,"While neural rendering has led to impressive advances in scene reconstruction +and novel view synthesis, it relies heavily on accurately pre-computed camera +poses. To relax this constraint, multiple efforts have been made to train +Neural Radiance Fields (NeRFs) without pre-processed camera poses. However, the +implicit representations of NeRFs provide extra challenges to optimize the 3D +structure and camera poses at the same time. On the other hand, the recently +proposed 3D Gaussian Splatting provides new opportunities given its explicit +point cloud representations. This paper leverages both the explicit geometric +representation and the continuity of the input video stream to perform novel +view synthesis without any SfM preprocessing. We process the input frames in a +sequential manner and progressively grow the 3D Gaussians set by taking one +input frame at a time, without the need to pre-compute the camera poses. Our +method significantly improves over previous approaches in view synthesis and +camera pose estimation under large motion changes. Our project page is +https://oasisyang.github.io/colmap-free-3dgs",cs.CV,['cs.CV'] +Scaling Laws for Data Filtering: Data Curation cannot be Compute Agnostic,Sachin Goyal · Pratyush Maini · Zachary Lipton · Aditi Raghunathan · Zico Kolter, ,https://arxiv.org/abs/2404.07177v1,,2404.07177v1.pdf,Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic,"Vision-language models (VLMs) are trained for thousands of GPU hours on +carefully curated web datasets. In recent times, data curation has gained +prominence with several works developing strategies to retain 'high-quality' +subsets of 'raw' scraped data. For instance, the LAION public dataset retained +only 10% of the total crawled data. However, these strategies are typically +developed agnostic of the available compute for training. In this paper, we +first demonstrate that making filtering decisions independent of training +compute is often suboptimal: the limited high-quality data rapidly loses its +utility when repeated, eventually requiring the inclusion of 'unseen' but +'lower-quality' data. To address this quality-quantity tradeoff +($\texttt{QQT}$), we introduce neural scaling laws that account for the +non-homogeneous nature of web data, an angle ignored in existing literature. +Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various +quality subsets of web data; (ii) account for how utility diminishes for a data +point at its 'nth' repetition; and (iii) formulate the mutual interaction of +various data pools when combined, enabling the estimation of model performance +on a combination of multiple data pools without ever jointly training on them. +Our key message is that data curation $\textit{cannot}$ be agnostic of the +total compute that a model will be trained for. Our scaling laws allow us to +curate the best possible pool for achieving top performance on Datacomp at +various compute budgets, carving out a pareto-frontier for data curation. Code +is available at https://github.com/locuslab/scaling_laws_data_filtering.",cs.LG,['cs.LG'] +GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces,Yingwenqi Jiang · Jiadong Tu · Yuan Liu · Xifeng Gao · Xiaoxiao Long · Wenping Wang · Yuexin Ma, ,https://arxiv.org/abs/2311.17977,,2311.17977.pdf,GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces,"The advent of neural 3D Gaussians has recently brought about a revolution in +the field of neural rendering, facilitating the generation of high-quality +renderings at real-time speeds. However, the explicit and discrete +representation encounters challenges when applied to scenes featuring +reflective surfaces. In this paper, we present GaussianShader, a novel method +that applies a simplified shading function on 3D Gaussians to enhance the +neural rendering in scenes with reflective surfaces while preserving the +training and rendering efficiency. The main challenge in applying the shading +function lies in the accurate normal estimation on discrete 3D Gaussians. +Specifically, we proposed a novel normal estimation framework based on the +shortest axis directions of 3D Gaussians with a delicately designed loss to +make the consistency between the normals and the geometries of Gaussian +spheres. Experiments show that GaussianShader strikes a commendable balance +between efficiency and visual quality. Our method surpasses Gaussian Splatting +in PSNR on specular object datasets, exhibiting an improvement of 1.57dB. When +compared to prior works handling reflective surfaces, such as Ref-NeRF, our +optimization time is significantly accelerated (23h vs. 0.58h). Please click on +our project website to see more results.",cs.CV,['cs.CV'] +BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation,Qihang Zhang · Yinghao Xu · Yujun Shen · Bo Dai · Bolei Zhou · Ceyuan Yang, ,https://arxiv.org/abs/2312.02136,,2312.02136.pdf,BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation,"Generating large-scale 3D scenes cannot simply apply existing 3D object +synthesis technique since 3D scenes usually hold complex spatial configurations +and consist of a number of objects at varying scales. We thus propose a +practical and efficient 3D representation that incorporates an equivariant +radiance field with the guidance of a bird's-eye view (BEV) map. Concretely, +objects of synthesized 3D scenes could be easily manipulated through steering +the corresponding BEV maps. Moreover, by adequately incorporating positional +encoding and low-pass filters into the generator, the representation becomes +equivariant to the given BEV map. Such equivariance allows us to produce +large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and +then stitching them with smooth consistency. Extensive experiments on 3D scene +datasets demonstrate the effectiveness of our approach. Our project website is +at https://zqh0253.github.io/BerfScene/.",cs.CV,['cs.CV'] +L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream,Jingtao Sun · Yaonan Wang · Mingtao Feng · Yulan Guo · Ajmal Mian · Mike Zheng Shou, ,https://arxiv.org/abs/2403.12728,,2403.12728.pdf,Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation,"Fully-supervised category-level pose estimation aims to determine the 6-DoF +poses of unseen instances from known categories, requiring expensive mannual +labeling costs. Recently, various self-supervised category-level pose +estimation methods have been proposed to reduce the requirement of the +annotated datasets. However, most methods rely on synthetic data or 3D CAD +model for self-supervised training, and they are typically limited to +addressing single-object pose problems without considering multi-objective +tasks or shape reconstruction. To overcome these challenges and limitations, we +introduce a diffusion-driven self-supervised network for multi-object shape +reconstruction and categorical pose estimation, only leveraging the shape +priors. Specifically, to capture the SE(3)-equivariant pose features and 3D +scale-invariant shape information, we present a Prior-Aware Pyramid 3D Point +Transformer in our network. This module adopts a point convolutional layer with +radial-kernels for pose-aware learning and a 3D scale-invariant graph +convolution layer for object-level shape representation, respectively. +Furthermore, we introduce a pretrain-to-refine self-supervised training +paradigm to train our network. It enables proposed network to capture the +associations between shape priors and observations, addressing the challenge of +intra-class shape variations by utilising the diffusion mechanism. Extensive +experiments conducted on four public datasets and a self-built dataset +demonstrate that our method significantly outperforms state-of-the-art +self-supervised category-level baselines and even surpasses some +fully-supervised instance-level and category-level methods.",cs.CV,['cs.CV'] +Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection,Jiawen Zhu · Choubo Ding · Yu Tian · Guansong Pang, ,https://arxiv.org/abs/2310.12790,,2310.12790.pdf,Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection,"Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly +detection area - aims at utilizing a few samples of anomaly classes seen during +training to detect unseen anomalies (i.e., samples from open-set anomaly +classes), while effectively identifying the seen anomalies. Benefiting from the +prior knowledge illustrated by the seen anomalies, current OSAD methods can +often largely reduce false positive errors. However, these methods are trained +in a closed-set setting and treat the anomaly examples as from a homogeneous +distribution, rendering them less effective in generalizing to unseen anomalies +that can be drawn from any distribution. This paper proposes to learn +heterogeneous anomaly distributions using the limited anomaly examples to +address this issue. To this end, we introduce a novel approach, namely Anomaly +Heterogeneity Learning (AHL), that simulates a diverse set of heterogeneous +anomaly distributions and then utilizes them to learn a unified heterogeneous +abnormality model in surrogate open-set environments. Further, AHL is a generic +framework that existing OSAD models can plug and play for enhancing their +abnormality modeling. Extensive experiments on nine real-world anomaly +detection datasets show that AHL can 1) substantially enhance different +state-of-the-art OSAD models in detecting seen and unseen anomalies, and 2) +effectively generalize to unseen anomalies in new domains. Code is available at +https://github.com/mala-lab/AHL.",cs.CV,['cs.CV'] +"1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness",Bernd Prach · Fabio Brau · Giorgio Buttazzo · Christoph Lampert, ,https://arxiv.org/abs/2311.16833,,2311.16833.pdf,"1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness","The robustness of neural networks against input perturbations with bounded +magnitude represents a serious concern in the deployment of deep learning +models in safety-critical systems. Recently, the scientific community has +focused on enhancing certifiable robustness guarantees by crafting 1-Lipschitz +neural networks that leverage Lipschitz bounded dense and convolutional layers. +Although different methods have been proposed in the literature to achieve this +goal, understanding the performance of such methods is not straightforward, +since different metrics can be relevant (e.g., training time, memory usage, +accuracy, certifiable robustness) for different applications. For this reason, +this work provides a thorough theoretical and empirical comparison between +methods by evaluating them in terms of memory usage, speed, and certifiable +robust accuracy. The paper also provides some guidelines and recommendations to +support the user in selecting the methods that work best depending on the +available resources. We provide code at +https://github.com/berndprach/1LipschitzLayersCompared.",cs.LG,"['cs.LG', 'cs.CV', 'cs.NE']" +Bootstrapping Autonomous Driving Radars with Self-Supervised Learning,Yiduo Hao · Sohrab Madani · Junfeng Guan · Mo Alloulah · Saurabh Gupta · Haitham Al Hassanieh, ,https://arxiv.org/abs/2312.04519,,2312.04519.pdf,Bootstrapping Autonomous Driving Radars with Self-Supervised Learning,"The perception of autonomous vehicles using radars has attracted increased +research interest due its ability to operate in fog and bad weather. However, +training radar models is hindered by the cost and difficulty of annotating +large-scale radar data. To overcome this bottleneck, we propose a +self-supervised learning framework to leverage the large amount of unlabeled +radar data to pre-train radar-only embeddings for self-driving perception +tasks. The proposed method combines radar-to-radar and radar-to-vision +contrastive losses to learn a general representation from unlabeled radar +heatmaps paired with their corresponding camera images. When used for +downstream object detection, we demonstrate that the proposed self-supervision +framework can improve the accuracy of state-of-the-art supervised baselines by +$5.8\%$ in mAP. Code is available at \url{https://github.com/yiduohao/Radical}.",cs.CV,['cs.CV'] +CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models,Tuna Han Salih Meral · Enis Simsar · Federico Tombari · Pinar Yanardag, ,https://arxiv.org/abs/2312.06059v1,,2312.06059v1.pdf,CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models,"Images produced by text-to-image diffusion models might not always faithfully +represent the semantic intent of the provided text prompt, where the model +might overlook or entirely fail to produce certain objects. Existing solutions +often require customly tailored functions for each of these problems, leading +to sub-optimal results, especially for complex prompts. Our work introduces a +novel perspective by tackling this challenge in a contrastive context. Our +approach intuitively promotes the segregation of objects in attention maps +while also maintaining that pairs of related attributes are kept close to each +other. We conduct extensive experiments across a wide variety of scenarios, +each involving unique combinations of objects, attributes, and scenes. These +experiments effectively showcase the versatility, efficiency, and flexibility +of our method in working with both latent and pixel-based diffusion models, +including Stable Diffusion and Imagen. Moreover, we publicly share our source +code to facilitate further research.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model,Dian Zheng · Xiao-Ming Wu · Shuzhou Yang · Jian Zhang · Jian-Fang Hu · Wei-Shi Zheng, ,https://arxiv.org/abs/2403.11157,,2403.11157.pdf,Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model,"Universal image restoration is a practical and potential computer vision task +for real-world applications. The main challenge of this task is handling the +different degradation distributions at once. Existing methods mainly utilize +task-specific conditions (e.g., prompt) to guide the model to learn different +distributions separately, named multi-partite mapping. However, it is not +suitable for universal model learning as it ignores the shared information +between different tasks. In this work, we propose an advanced selective +hourglass mapping strategy based on diffusion model, termed DiffUIR. Two novel +considerations make our DiffUIR non-trivial. Firstly, we equip the model with +strong condition guidance to obtain accurate generation direction of diffusion +model (selective). More importantly, DiffUIR integrates a flexible shared +distribution term (SDT) into the diffusion algorithm elegantly and naturally, +which gradually maps different distributions into a shared one. In the reverse +process, combined with SDT and strong condition guidance, DiffUIR iteratively +guides the shared distribution to the task-specific distribution with high +image quality (hourglass). Without bells and whistles, by only modifying the +mapping strategy, we achieve state-of-the-art performance on five image +restoration tasks, 22 benchmarks in the universal setting and zero-shot +generalization setting. Surprisingly, by only using a lightweight model (only +0.89M), we could achieve outstanding performance. The source code and +pre-trained models are available at https://github.com/iSEE-Laboratory/DiffUIR",cs.CV,['cs.CV'] +ActiveDC: Distribution Calibration for Active Finetuning,Wenshuai Xu · Zhenghui Hu · Yu Lu · Jinzhou Meng · Qingjie Liu · Yunhong Wang, ,https://arxiv.org/abs/2311.07634,,2311.07634.pdf,ActiveDC: Distribution Calibration for Active Finetuning,"The pretraining-finetuning paradigm has gained popularity in various computer +vision tasks. In this paradigm, the emergence of active finetuning arises due +to the abundance of large-scale data and costly annotation requirements. Active +finetuning involves selecting a subset of data from an unlabeled pool for +annotation, facilitating subsequent finetuning. However, the use of a limited +number of training samples can lead to a biased distribution, potentially +resulting in model overfitting. In this paper, we propose a new method called +ActiveDC for the active finetuning tasks. Firstly, we select samples for +annotation by optimizing the distribution similarity between the subset to be +selected and the entire unlabeled pool in continuous space. Secondly, we +calibrate the distribution of the selected samples by exploiting implicit +category information in the unlabeled pool. The feature visualization provides +an intuitive sense of the effectiveness of our approach to distribution +calibration. We conducted extensive experiments on three image classification +datasets with different sampling ratios. The results indicate that ActiveDC +consistently outperforms the baseline performance in all image classification +tasks. The improvement is particularly significant when the sampling ratio is +low, with performance gains of up to 10%. Our code will be released.",cs.CV,['cs.CV'] +Extreme Point Supervised Instance Segmentation,Hyeonjun Lee · Sehyun Hwang · Suha Kwak, ,https://arxiv.org/abs/2405.20729,,2405.20729.pdf,Extreme Point Supervised Instance Segmentation,"This paper introduces a novel approach to learning instance segmentation +using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost +points, of each object. These points are readily available in the modern +bounding box annotation process while offering strong clues for precise +segmentation, and thus allows to improve performance at the same annotation +cost with box-supervised methods. Our work considers extreme points as a part +of the true instance mask and propagates them to identify potential foreground +and background points, which are all together used for training a pseudo label +generator. Then pseudo labels given by the generator are in turn used for +supervised learning of our final model. On three public benchmarks, our method +significantly outperforms existing box-supervised methods, further narrowing +the gap with its fully supervised counterpart. In particular, our model +generates high-quality masks when a target object is separated into multiple +parts, where previous box-supervised methods often fail.",cs.CV,['cs.CV'] +Towards Robust 3D Pose Transfer with Adversarial Learning,Haoyu Chen · Hao Tang · Ehsan Adeli · Guoying Zhao, ,https://arxiv.org/abs/2404.02242,,2404.02242.pdf,Towards Robust 3D Pose Transfer with Adversarial Learning,"3D pose transfer that aims to transfer the desired pose to a target mesh is +one of the most challenging 3D generation tasks. Previous attempts rely on +well-defined parametric human models or skeletal joints as driving pose +sources. However, to obtain those clean pose sources, cumbersome but necessary +pre-processing pipelines are inevitable, hindering implementations of the +real-time applications. This work is driven by the intuition that the +robustness of the model can be enhanced by introducing adversarial samples into +the training, leading to a more invulnerable model to the noisy inputs, which +even can be further extended to directly handling the real-world data like raw +point clouds/scans without intermediate processing. Furthermore, we propose a +novel 3D pose Masked Autoencoder (3D-PoseMAE), a customized MAE that +effectively learns 3D extrinsic presentations (i.e., pose). 3D-PoseMAE +facilitates learning from the aspect of extrinsic attributes by simultaneously +generating adversarial samples that perturb the model and learning the +arbitrary raw noisy poses via a multi-scale masking strategy. Both qualitative +and quantitative studies show that the transferred meshes given by our network +result in much better quality. Besides, we demonstrate the strong +generalizability of our method on various poses, different domains, and even +raw scans. Experimental results also show meaningful insights that the +intermediate adversarial samples generated in the training can successfully +attack the existing pose transfer models.",cs.CV,['cs.CV'] +Improving Image Restoration through Removing Degradations in Textual Representations,Jingbo Lin · Zhilu Zhang · Yuxiang Wei · Dongwei Ren · Dongsheng Jiang · Qi Tian · Wangmeng Zuo, ,https://arxiv.org/abs/2312.17334,,2312.17334.pdf,Improving Image Restoration through Removing Degradations in Textual Representations,"In this paper, we introduce a new perspective for improving image restoration +by removing degradation in the textual representations of a given degraded +image. Intuitively, restoration is much easier on text modality than image one. +For example, it can be easily conducted by removing degradation-related words +while keeping the content-aware words. Hence, we combine the advantages of +images in detail description and ones of text in degradation removal to perform +restoration. To address the cross-modal assistance, we propose to map the +degraded images into textual representations for removing the degradations, and +then convert the restored textual representations into a guidance image for +assisting image restoration. In particular, We ingeniously embed an +image-to-text mapper and text restoration module into CLIP-equipped +text-to-image models to generate the guidance. Then, we adopt a simple +coarse-to-fine approach to dynamically inject multi-scale information from +guidance to image restoration networks. Extensive experiments are conducted on +various image restoration tasks, including deblurring, dehazing, deraining, and +denoising, and all-in-one image restoration. The results showcase that our +method outperforms state-of-the-art ones across all these tasks. The codes and +models are available at \url{https://github.com/mrluin/TextualDegRemoval}.",cs.CV,['cs.CV'] +Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution,Longguang Wang · Juncheng Li · Yingqian Wang · Qingyong Hu · Yulan Guo, ,,https://link.springer.com/article/10.1007/s11760-023-02936-x,,,,,nan +Dispersed Structured Light for Hyperspectral 3D Imaging,Suhyun Shin · Seokjun Choi · Felix Heide · Seung-Hwan Baek, ,https://arxiv.org/abs/2311.18287,,2311.18287.pdf,Dispersed Structured Light for Hyperspectral 3D Imaging,"Hyperspectral 3D imaging aims to acquire both depth and spectral information +of a scene. However, existing methods are either prohibitively expensive and +bulky or compromise on spectral and depth accuracy. In this work, we present +Dispersed Structured Light (DSL), a cost-effective and compact method for +accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera +system by placing a sub-millimeter thick diffraction grating film front of the +projector. The grating disperses structured light based on light wavelength. To +utilize the dispersed structured light, we devise a model for dispersive +projection image formation and a per-pixel hyperspectral 3D reconstruction +method. We validate DSL by instantiating a compact experimental prototype. DSL +achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth +error of 1mm. We demonstrate that DSL outperforms prior work on practical +hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D +imaging for diverse application domains, including computer vision and +graphics, cultural heritage, geology, and biology.",eess.IV,"['eess.IV', 'cs.CV', 'cs.GR']" +MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric,Haokun Lin · Haoli Bai · Zhili Liu · Lu Hou · Muyi Sun · Linqi Song · Ying Wei · Zhenan Sun, ,,https://paperswithcode.com/paper/mope-clip-structured-pruning-for-efficient,,,,,nan +"TurboSL: Dense, Accurate and Fast 3D by Neural Inverse Structured Light",Parsa Mirdehghan · Maxx Wu · Wenzheng Chen · Wenzheng Chen · David B. Lindell · Kiriakos Kutulakos, ,https://arxiv.org/abs/2306.13361,,2306.13361.pdf,Neural 360$^\circ$ Structured Light with Learned Metasurfaces,"Structured light has proven instrumental in 3D imaging, LiDAR, and +holographic light projection. Metasurfaces, comprised of sub-wavelength-sized +nanostructures, facilitate 180$^\circ$ field-of-view (FoV) structured light, +circumventing the restricted FoV inherent in traditional optics like +diffractive optical elements. However, extant metasurface-facilitated +structured light exhibits sub-optimal performance in downstream tasks, due to +heuristic pattern designs such as periodic dots that do not consider the +objectives of the end application. In this paper, we present neural 360$^\circ$ +structured light, driven by learned metasurfaces. We propose a differentiable +framework, that encompasses a computationally-efficient 180$^\circ$ wave +propagation model and a task-specific reconstructor, and exploits both +transmission and reflection channels of the metasurface. Leveraging a +first-order optimizer within our differentiable framework, we optimize the +metasurface design, thereby realizing neural 360$^\circ$ structured light. We +have utilized neural 360$^\circ$ structured light for holographic light +projection and 3D imaging. Specifically, we demonstrate the first 360$^\circ$ +light projection of complex patterns, enabled by our propagation model that can +be computationally evaluated 50,000$\times$ faster than the Rayleigh-Sommerfeld +propagation. For 3D imaging, we improve depth-estimation accuracy by +5.09$\times$ in RMSE compared to the heuristically-designed structured light. +Neural 360$^\circ$ structured light promises robust 360$^\circ$ imaging and +display for robotics, extended-reality systems, and human-computer +interactions.",physics.optics,"['physics.optics', 'cs.CV', 'eess.IV']" +Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning,wenlong deng · Christos Thrampoulidis · Xiaoxiao Li, ,https://arxiv.org/abs/2310.18285,,2310.18285.pdf,Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning,"Vision Transformers (ViT) and Visual Prompt Tuning (VPT) achieve +state-of-the-art performance with improved efficiency in various computer +vision tasks. This suggests a promising paradigm shift of adapting pre-trained +ViT models to Federated Learning (FL) settings. However, the challenge of data +heterogeneity among FL clients presents a significant hurdle in effectively +deploying ViT models. Existing Generalized FL (GFL) and Personalized FL (PFL) +methods have limitations in balancing performance across both global and local +data distributions. In this paper, we present a novel algorithm, SGPT, that +integrates GFL and PFL approaches by employing a unique combination of both +shared and group-specific prompts. This design enables SGPT to capture both +common and group-specific features. A key feature of SGPT is its prompt +selection module, which facilitates the training of a single global model +capable of automatically adapting to diverse local client data distributions +without the need for local fine-tuning. To effectively train the prompts, we +utilize block coordinate descent (BCD), learning from common feature +information (shared prompts), and then more specialized knowledge (group +prompts) iteratively. Theoretically, we justify that learning the proposed +prompts can reduce the gap between global and local performance. Empirically, +we conduct experiments on both label and feature heterogeneity settings in +comparison with state-of-the-art baselines, along with extensive ablation +studies, to substantiate the superior performance of SGPT.",cs.LG,"['cs.LG', 'cs.CV']" +Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models,Shitian Zhao · Zhuowan Li · YadongLu · Alan L. Yuille · Yan Wang, ,https://arxiv.org/abs/2312.06685,,2312.06685.pdf,Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models,"While Multi-modal Language Models (MLMs) demonstrate impressive multimodal +ability, they still struggle on providing factual and precise responses for +tasks like visual question answering (VQA). In this paper, we address this +challenge from the perspective of contextual information. We propose Causal +Context Generation, Causal-CoG, which is a prompting strategy that engages +contextual information to enhance precise VQA during inference. Specifically, +we prompt MLMs to generate contexts, i.e, text description of an image, and +engage the generated contexts for question answering. Moreover, we investigate +the advantage of contexts on VQA from a causality perspective, introducing +causality filtering to select samples for which contextual information is +helpful. To show the effectiveness of Causal-CoG, we run extensive experiments +on 10 multimodal benchmarks and show consistent improvements, e.g., +6.30% on +POPE, +13.69% on Vizwiz and +6.43% on VQAv2 compared to direct decoding, +surpassing existing methods. We hope Casual-CoG inspires explorations of +context knowledge in multimodal models, and serves as a plug-and-play strategy +for MLM decoding.",cs.AI,['cs.AI'] +Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi,Kangwei Yan · Fei Wang · Bo Qian · Han Ding · Jinsong Han · Xing Wei, ,https://arxiv.org/abs/2404.02041,,2404.02041.pdf,SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,"We present a new self-supervised approach, SelfPose3d, for estimating 3d +poses of multiple persons from multiple camera views. Unlike current +state-of-the-art fully-supervised methods, our approach does not require any 2d +or 3d ground-truth poses and uses only the multi-view input images from a +calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d +human pose estimator. We propose two self-supervised learning objectives: +self-supervised person localization in 3d space and self-supervised 3d pose +estimation. We achieve self-supervised 3d person localization by training the +model on synthetically generated 3d points, serving as 3d person root +positions, and on the projected root-heatmaps in all the views. We then model +the 3d poses of all the localized persons with a bottleneck representation, map +them onto all views obtaining 2d joints, and render them using 2d Gaussian +heatmaps in an end-to-end differentiable manner. Afterwards, we use the +corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To +alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive +supervision attention mechanism to guide the self-supervision. Our experiments +and analysis on three public benchmark datasets, including Panoptic, Shelf, and +Campus, show the effectiveness of our approach, which is comparable to +fully-supervised methods. Code is available at +\url{https://github.com/CAMMA-public/SelfPose3D}",cs.CV,['cs.CV'] +Multi-agent Collaborative Perception via Motion-aware Robust Communication Network,Shixin Hong · Yu LIU · Zhi Li · Shaohui Li · You He, ,https://arxiv.org/abs/2401.12694,,2401.12694.pdf,Pragmatic Communication in Multi-Agent Collaborative Perception,"Collaborative perception allows each agent to enhance its perceptual +abilities by exchanging messages with others. It inherently results in a +trade-off between perception ability and communication costs. Previous works +transmit complete full-frame high-dimensional feature maps among agents, +resulting in substantial communication costs. To promote communication +efficiency, we propose only transmitting the information needed for the +collaborator's downstream task. This pragmatic communication strategy focuses +on three key aspects: i) pragmatic message selection, which selects +task-critical parts from the complete data, resulting in spatially and +temporally sparse feature vectors; ii) pragmatic message representation, which +achieves pragmatic approximation of high-dimensional feature vectors with a +task-adaptive dictionary, enabling communicating with integer indices; iii) +pragmatic collaborator selection, which identifies beneficial collaborators, +pruning unnecessary communication links. Following this strategy, we first +formulate a mathematical optimization framework for the +perception-communication trade-off and then propose PragComm, a multi-agent +collaborative perception system with two key components: i) single-agent +detection and tracking and ii) pragmatic collaboration. The proposed PragComm +promotes pragmatic communication and adapts to a wide range of communication +conditions. We evaluate PragComm for both collaborative 3D object detection and +tracking tasks in both real-world, V2V4Real, and simulation datasets, OPV2V and +V2X-SIM2.0. PragComm consistently outperforms previous methods with more than +32.7K times lower communication volume on OPV2V. Code is available at +github.com/PhyllisH/PragComm.",cs.CV,['cs.CV'] +Dense Optical Tracking: Connecting the Dots,Guillaume Le Moing · Jean Ponce · Cordelia Schmid,https://github.com/16lemoing/dot,https://arxiv.org/abs/2312.00786,,2312.00786.pdf,Dense Optical Tracking: Connecting the Dots,"Recent approaches to point tracking are able to recover the trajectory of any +scene point through a large portion of a video despite the presence of +occlusions. They are, however, too slow in practice to track every point +observed in a single frame in a reasonable amount of time. This paper +introduces DOT, a novel, simple and efficient method for solving this problem. +It first extracts a small set of tracks from key regions at motion boundaries +using an off-the-shelf point tracking algorithm. Given source and target +frames, DOT then computes rough initial estimates of a dense flow field and +visibility mask through nearest-neighbor interpolation, before refining them +using a learnable optical flow estimator that explicitly handles occlusions and +can be trained on synthetic data with ground-truth correspondences. We show +that DOT is significantly more accurate than current optical flow techniques, +outperforms sophisticated ""universal"" trackers like OmniMotion, and is on par +with, or better than, the best point tracking algorithms like CoTracker while +being at least two orders of magnitude faster. Quantitative and qualitative +experiments with synthetic and real videos validate the promise of the proposed +approach. Code, data, and videos showcasing the capabilities of our approach +are available in the project webpage: https://16lemoing.github.io/dot .",cs.CV,['cs.CV'] +Enhancing Post-training Quantization Calibration through Contrastive Learning,Yuzhang Shang · Gaowen Liu · Ramana Kompella · Yan Yan, ,https://arxiv.org/abs/2311.06322,,2311.06322.pdf,Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models,"Diffusion models have achieved great success due to their remarkable +generation ability. However, their high computational overhead is still a +troublesome problem. Recent studies have leveraged post-training quantization +(PTQ) to compress diffusion models. However, most of them only focus on +unconditional models, leaving the quantization of widely used large pretrained +text-to-image models, e.g., Stable Diffusion, largely unexplored. In this +paper, we propose a novel post-training quantization method PCR (Progressive +Calibration and Relaxing) for text-to-image diffusion models, which consists of +a progressive calibration strategy that considers the accumulated quantization +error across timesteps, and an activation relaxing strategy that improves the +performance with negligible cost. Additionally, we demonstrate the previous +metrics for text-to-image diffusion model quantization are not accurate due to +the distribution gap. To tackle the problem, we propose a novel QDiffBench +benchmark, which utilizes data in the same domain for more accurate evaluation. +Besides, QDiffBench also considers the generalization performance of the +quantized model outside the calibration dataset. Extensive experiments on +Stable Diffusion and Stable Diffusion XL demonstrate the superiority of our +method and benchmark. Moreover, we are the first to achieve quantization for +Stable Diffusion XL while maintaining the performance.",cs.CV,"['cs.CV', 'cs.LG']" +PanoPose: Self-supervised Relative Pose Estimation for Panoramic Images,Diantao Tu · Hainan Cui · Xianwei Zheng · Shuhan Shen, ,https://arxiv.org/abs/2404.02041,,,SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,"We present a new self-supervised approach, SelfPose3d, for estimating 3d +poses of multiple persons from multiple camera views. Unlike current +state-of-the-art fully-supervised methods, our approach does not require any 2d +or 3d ground-truth poses and uses only the multi-view input images from a +calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d +human pose estimator. We propose two self-supervised learning objectives: +self-supervised person localization in 3d space and self-supervised 3d pose +estimation. We achieve self-supervised 3d person localization by training the +model on synthetically generated 3d points, serving as 3d person root +positions, and on the projected root-heatmaps in all the views. We then model +the 3d poses of all the localized persons with a bottleneck representation, map +them onto all views obtaining 2d joints, and render them using 2d Gaussian +heatmaps in an end-to-end differentiable manner. Afterwards, we use the +corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To +alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive +supervision attention mechanism to guide the self-supervision. Our experiments +and analysis on three public benchmark datasets, including Panoptic, Shelf, and +Campus, show the effectiveness of our approach, which is comparable to +fully-supervised methods. Code is available at +\url{https://github.com/CAMMA-public/SelfPose3D}",cs.CV,['cs.CV'] +Semantics-aware Motion Retargeting with Vision-Language Models,Haodong Zhang · ZhiKe Chen · Haocheng Xu · Lei Hao · Xiaofei Wu · Songcen Xu · Zhensong Zhang · Yue Wang · Rong Xiong, ,https://arxiv.org/abs/2312.01964,,2312.01964.pdf,Semantics-aware Motion Retargeting with Vision-Language Models,"Capturing and preserving motion semantics is essential to motion retargeting +between animation characters. However, most of the previous works neglect the +semantic information or rely on human-designed joint-level representations. +Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with +the advantage of vision-language models to extract and maintain meaningful +motion semantics. We utilize a differentiable module to render 3D motions. Then +the high-level motion semantics are incorporated into the motion retargeting +process by feeding the vision-language model with the rendered images and +aligning the extracted semantic embeddings. To ensure the preservation of +fine-grained motion details and high-level semantics, we adopt a two-stage +pipeline consisting of skeleton-aware pre-training and fine-tuning with +semantics and geometry constraints. Experimental results show the effectiveness +of the proposed method in producing high-quality motion retargeting results +while accurately preserving motion semantics.",cs.CV,"['cs.CV', 'cs.GR']" +HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances,Supreeth Narasimhaswamy · Uttaran Bhattacharya · Xiang Chen · Ishita Dasgupta · Saayan Mitra · Minh Hoai, ,https://arxiv.org/abs/2403.01693,,2403.01693.pdf,HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances,"Text-to-image generative models can generate high-quality humans, but realism +is lost when generating hands. Common artifacts include irregular hand poses, +shapes, incorrect numbers of fingers, and physically implausible finger +orientations. To generate images with realistic hands, we propose a novel +diffusion-based architecture called HanDiffuser that achieves realism by +injecting hand embeddings in the generative process. HanDiffuser consists of +two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and +MANO-Hand parameters from input text prompts, and a Text-Guided +Hand-Params-to-Image diffusion model to synthesize images by conditioning on +the prompts and hand parameters generated by the previous component. We +incorporate multiple aspects of hand representation, including 3D shapes and +joint-level finger positions, orientations and articulations, for robust +learning and reliable performance during inference. We conduct extensive +quantitative and qualitative experiments and perform user studies to +demonstrate the efficacy of our method in generating images with high-quality +hands.",cs.CV,"['cs.CV', 'cs.AI']" +Communication-Efficient Collaborative Perception via Information Filling with Codebook,Yue Hu · Juntong Peng · Sifei Liu · Junhao Ge · Si Liu · Siheng Chen, ,https://arxiv.org/abs/2405.04966,,2405.04966.pdf,Communication-Efficient Collaborative Perception via Information Filling with Codebook,"Collaborative perception empowers each agent to improve its perceptual +ability through the exchange of perceptual messages with other agents. It +inherently results in a fundamental trade-off between perception ability and +communication cost. To address this bottleneck issue, our core idea is to +optimize the collaborative messages from two key aspects: representation and +selection. The proposed codebook-based message representation enables the +transmission of integer codes, rather than high-dimensional feature maps. The +proposed information-filling-driven message selection optimizes local messages +to collectively fill each agent's information demand, preventing information +overflow among multiple agents. By integrating these two designs, we propose +CodeFilling, a novel communication-efficient collaborative perception system, +which significantly advances the perception-communication trade-off and is +inclusive to both homogeneous and heterogeneous collaboration settings. We +evaluate CodeFilling in both a real-world dataset, DAIR-V2X, and a new +simulation dataset, OPV2VH+. Results show that CodeFilling outperforms previous +SOTA Where2comm on DAIR-V2X/OPV2VH+ with 1,333/1,206 times lower communication +volume. Our code is available at https://github.com/PhyllisH/CodeFilling.",cs.IT,"['cs.IT', 'cs.CV', 'cs.MA', 'math.IT']" +Adversarial Score Distillation: When score distillation meets GAN,Min Wei · Jingkai Zhou · Junyao Sun · Xuesong Zhang, ,https://arxiv.org/abs/2312.00739,,2312.00739.pdf,Adversarial Score Distillation: When score distillation meets GAN,"Existing score distillation methods are sensitive to classifier-free guidance +(CFG) scale: manifested as over-smoothness or instability at small CFG scales, +while over-saturation at large ones. To explain and analyze these issues, we +revisit the derivation of Score Distillation Sampling (SDS) and decipher +existing score distillation with the Wasserstein Generative Adversarial Network +(WGAN) paradigm. With the WGAN paradigm, we find that existing score +distillation either employs a fixed sub-optimal discriminator or conducts +incomplete discriminator optimization, resulting in the scale-sensitive issue. +We propose the Adversarial Score Distillation (ASD), which maintains an +optimizable discriminator and updates it using the complete optimization +objective. Experiments show that the proposed ASD performs favorably in 2D +distillation and text-to-3D tasks against existing methods. Furthermore, to +explore the generalization ability of our WGAN paradigm, we extend ASD to the +image editing task, which achieves competitive results. The project page and +code are at https://github.com/2y7c3/ASD.",cs.CV,['cs.CV'] +Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks,Bin Xiao · Haiping Wu · Weijian Xu · Xiyang Dai · Houdong Hu · Yumao Lu · Michael Zeng · Ce Liu · Lu Yuan, ,https://arxiv.org/abs/2311.06242,,2311.06242.pdf,Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks,"We introduce Florence-2, a novel vision foundation model with a unified, +prompt-based representation for a variety of computer vision and +vision-language tasks. While existing large vision models excel in transfer +learning, they struggle to perform a diversity of tasks with simple +instructions, a capability that implies handling the complexity of various +spatial hierarchy and semantic granularity. Florence-2 was designed to take +text-prompt as task instructions and generate desirable results in text forms, +whether it be captioning, object detection, grounding or segmentation. This +multi-task learning setup demands large-scale, high-quality annotated data. To +this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive +visual annotations on 126 million images, using an iterative strategy of +automated image annotation and model refinement. We adopted a +sequence-to-sequence structure to train Florence-2 to perform versatile and +comprehensive vision tasks. Extensive evaluations on numerous tasks +demonstrated Florence-2 to be a strong vision foundation model contender with +unprecedented zero-shot and fine-tuning capabilities.",cs.CV,['cs.CV'] +GOAT-Bench: A Benchmark for Multi-modal Lifelong Navigation,Mukul Khanna · Ram Ramrakhya · Gunjan Chhablani · Sriram Yenamandra · Theo Gervet · Matthew Chang · Zsolt Kira · Devendra Singh Chaplot · Dhruv Batra · Roozbeh Mottaghi, ,https://arxiv.org/abs/2404.06609,,2404.06609.pdf,GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation,"The Embodied AI community has made significant strides in visual navigation +tasks, exploring targets from 3D coordinates, objects, language descriptions, +and images. However, these navigation models often handle only a single input +modality as the target. With the progress achieved so far, it is time to move +towards universal navigation models capable of handling various goal types, +enabling more effective user interaction with robots. To facilitate this goal, +we propose GOAT-Bench, a benchmark for the universal navigation task referred +to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to +a sequence of targets specified by the category name, language description, or +image in an open-vocabulary fashion. We benchmark monolithic RL and modular +methods on the GOAT task, analyzing their performance across modalities, the +role of explicit and implicit scene memories, their robustness to noise in goal +specifications, and the impact of memory in lifelong scenarios.",cs.AI,"['cs.AI', 'cs.RO']" +LLaFS: When Large Language Models Meet Few-Shot Segmentation,Lanyun Zhu · Tianrun Chen · Deyi Ji · Deyi Ji · Jieping Ye · Jun Liu, ,https://arxiv.org/abs/2311.16926,,2311.16926.pdf,LLaFS: When Large Language Models Meet Few-Shot Segmentation,"This paper proposes LLaFS, the first attempt to leverage large language +models (LLMs) in few-shot segmentation. In contrast to the conventional +few-shot segmentation methods that only rely on the limited and biased +information from the annotated support images, LLaFS leverages the vast prior +knowledge gained by LLM as an effective supplement and directly uses the LLM to +segment images in a few-shot manner. To enable the text-based LLM to handle +image-related tasks, we carefully design an input instruction that allows the +LLM to produce segmentation results represented as polygons, and propose a +region-attribute table to simulate the human visual mechanism and provide +multi-modal guidance. We also synthesize pseudo samples and use curriculum +learning for pretraining to augment data and achieve better optimization. LLaFS +achieves state-of-the-art results on multiple datasets, showing the potential +of using LLMs for few-shot computer vision tasks.",cs.CV,['cs.CV'] +MVCPS-NeuS: Multi-view Constrained Photometric Stereo for Neural Surface Reconstruction,Hiroaki Santo · Fumio Okura · Yasuyuki Matsushita,https://github.com/hiroaki-santo/mvcps-neus,https://arxiv.org/abs/2405.12057,,2405.12057.pdf,NPLMV-PS: Neural Point-Light Multi-View Photometric Stereo,"In this work we present a novel multi-view photometric stereo (PS) method. +Like many works in 3D reconstruction we are leveraging neural shape +representations and learnt renderers. However, our work differs from the +state-of-the-art multi-view PS methods such as PS-NeRF or SuperNormal we +explicity leverage per-pixel intensity renderings rather than relying mainly on +estimated normals. + We model point light attenuation and explicitly raytrace cast shadows in +order to best approximate each points incoming radiance. This is used as input +to a fully neural material renderer that uses minimal prior assumptions and it +is jointly optimised with the surface. Finally, estimated normal and +segmentation maps can also incorporated in order to maximise the surface +accuracy. + Our method is among the first to outperform the classical approach of +DiLiGenT-MV and achieves average 0.2mm Chamfer distance for objects imaged at +approx 1.5m distance away with approximate 400x400 resolution. Moreover, we +show robustness to poor normals in low light count scenario, achieving 0.27mm +Chamfer distance when pixel rendering is used instead of estimated normals.",cs.CV,['cs.CV'] +FlowTrack: Revisiting Optical Flow for Long-Range Dense Tracking,Seokju Cho · Gabriel Huang · Seungryong Kim · Joon-Young Lee, ,https://arxiv.org/abs/2312.00786,,,Dense Optical Tracking: Connecting the Dots,"Recent approaches to point tracking are able to recover the trajectory of any +scene point through a large portion of a video despite the presence of +occlusions. They are, however, too slow in practice to track every point +observed in a single frame in a reasonable amount of time. This paper +introduces DOT, a novel, simple and efficient method for solving this problem. +It first extracts a small set of tracks from key regions at motion boundaries +using an off-the-shelf point tracking algorithm. Given source and target +frames, DOT then computes rough initial estimates of a dense flow field and +visibility mask through nearest-neighbor interpolation, before refining them +using a learnable optical flow estimator that explicitly handles occlusions and +can be trained on synthetic data with ground-truth correspondences. We show +that DOT is significantly more accurate than current optical flow techniques, +outperforms sophisticated ""universal"" trackers like OmniMotion, and is on par +with, or better than, the best point tracking algorithms like CoTracker while +being at least two orders of magnitude faster. Quantitative and qualitative +experiments with synthetic and real videos validate the promise of the proposed +approach. Code, data, and videos showcasing the capabilities of our approach +are available in the project webpage: https://16lemoing.github.io/dot .",cs.CV,['cs.CV'] +MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning,Matteo Farina · Massimiliano Mancini · Elia Cunegatti · Gaowen Liu · Giovanni Iacca · Elisa Ricci,https://github.com/FarinaMatteo/multiflow,https://arxiv.org/abs/2404.05621,,2404.05621.pdf,MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning,"While excellent in transfer learning, Vision-Language models (VLMs) come with +high computational costs due to their large number of parameters. To address +this issue, removing parameters via model pruning is a viable solution. +However, existing techniques for VLMs are task-specific, and thus require +pruning the network from scratch for each new task of interest. In this work, +we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). +Given a pretrained VLM, the goal is to find a unique pruned counterpart +transferable to multiple unknown downstream tasks. In this challenging setting, +the transferable representations already encoded in the pretrained model are a +key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a +first, gradient-free, pruning framework for TA-VLP where: (i) the importance of +a parameter is expressed in terms of its magnitude and its information flow, by +incorporating the saliency of the neurons it connects; and (ii) pruning is +driven by the emergent (multimodal) distribution of the VLM parameters after +pretraining. We benchmark eight state-of-the-art pruning algorithms in the +context of TA-VLP, experimenting with two VLMs, three vision-language tasks, +and three pruning ratios. Our experimental results show that MULTIFLOW +outperforms recent sophisticated, combinatorial competitors in the vast +majority of the cases, paving the way towards addressing TA-VLP. The code is +publicly available at https://github.com/FarinaMatteo/multiflow.",cs.CV,['cs.CV'] +In-Context Matting,He Guo · Zixuan Ye · Zhiguo Cao · Hao Lu, ,https://arxiv.org/abs/2403.15789,,2403.15789.pdf,In-Context Matting,"We introduce in-context matting, a novel task setting of image matting. Given +a reference image of a certain foreground and guided priors such as points, +scribbles, and masks, in-context matting enables automatic alpha estimation on +a batch of target images of the same foreground category, without additional +auxiliary input. This setting marries good performance in auxiliary input-based +matting and ease of use in automatic matting, which finds a good trade-off +between customization and automation. To overcome the key challenge of accurate +foreground matching, we introduce IconMatting, an in-context matting model +built upon a pre-trained text-to-image diffusion model. Conditioned on inter- +and intra-similarity matching, IconMatting can make full use of reference +context to generate accurate target alpha mattes. To benchmark the task, we +also introduce a novel testing dataset ICM-$57$, covering 57 groups of +real-world images. Quantitative and qualitative results on the ICM-57 testing +set show that IconMatting rivals the accuracy of trimap-based matting while +retaining the automation level akin to automatic matting. Code is available at +https://github.com/tiny-smart/in-context-matting",cs.CV,['cs.CV'] +Interactive Continual Learning: Fast and Slow Thinking,Biqing Qi · Xinquan Chen · Junqi Gao · Dong Li · Jianxing Liu · Ligang Wu · Bowen Zhou, ,https://arxiv.org/abs/2403.02628,,2403.02628.pdf,Interactive Continual Learning: Fast and Slow Thinking,"Advanced life forms, sustained by the synergistic interaction of neural +cognitive mechanisms, continually acquire and transfer knowledge throughout +their lifespan. In contrast, contemporary machine learning paradigms exhibit +limitations in emulating the facets of continual learning (CL). Nonetheless, +the emergence of large language models (LLMs) presents promising avenues for +realizing CL via interactions with these models. Drawing on Complementary +Learning System theory, this paper presents a novel Interactive Continual +Learning (ICL) framework, enabled by collaborative interactions among models of +various sizes. Specifically, we assign the ViT model as System1 and multimodal +LLM as System2. To enable the memory module to deduce tasks from class +information and enhance Set2Set retrieval, we propose the Class-Knowledge-Task +Multi-Head Attention (CKT-MHA). Additionally, to improve memory retrieval in +System1 through enhanced geometric representation, we introduce the CL-vMF +mechanism, based on the von Mises-Fisher (vMF) distribution. Meanwhile, we +introduce the von Mises-Fisher Outlier Detection and Interaction (vMF-ODI) +strategy to identify hard examples, thus enhancing collaboration between +System1 and System2 for complex reasoning realization. Comprehensive evaluation +of our proposed ICL demonstrates significant resistance to forgetting and +superior performance relative to existing methods. Code is available at +github.com/ICL.",cs.CV,"['cs.CV', 'cs.LG']" +The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing,Denis Bobkov · Vadim Titov · Aibek Alanov · Dmitry Vetrov, ,https://ar5iv.labs.arxiv.org/html/2203.08450,,2203.08450.pdf,The Devil Is in the Details: Window-based Attention for Image Compression,"Learned image compression methods have exhibited superior rate-distortion +performance than classical image compression standards. Most existing learned +image compression models are based on Convolutional Neural Networks (CNNs). +Despite great contributions, a main drawback of CNN based model is that its +structure is not designed for capturing local redundancy, especially the +non-repetitive textures, which severely affects the reconstruction quality. +Therefore, how to make full use of both global structure and local texture +becomes the core problem for learning-based image compression. Inspired by +recent progresses of Vision Transformer (ViT) and Swin Transformer, we found +that combining the local-aware attention mechanism with the global-related +feature learning could meet the expectation in image compression. In this +paper, we first extensively study the effects of multiple kinds of attention +mechanisms for local features learning, then introduce a more straightforward +yet effective window-based local attention block. The proposed window-based +attention is very flexible which could work as a plug-and-play component to +enhance CNN and Transformer models. Moreover, we propose a novel Symmetrical +TransFormer (STF) framework with absolute transformer blocks in the +down-sampling encoder and up-sampling decoder. Extensive experimental +evaluations have shown that the proposed method is effective and outperforms +the state-of-the-art methods. The code is publicly available at +https://github.com/Googolxx/STF.",eess.IV,"['eess.IV', 'cs.CV']" +RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos,Hongchi Xia · Yang Fu · Sifei Liu · Xiaolong Wang, ,https://arxiv.org/abs/2401.12592,,2401.12592.pdf,RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos,"We introduce a new RGB-D object dataset captured in the wild called +WildRGB-D. Unlike most existing real-world object-centric datasets which only +come with RGB capturing, the direct capture of the depth channel allows better +3D annotations and broader downstream applications. WildRGB-D comprises +large-scale category-level RGB-D object videos, which are taken using an iPhone +to go around the objects in 360 degrees. It contains around 8500 recorded +objects and nearly 20000 RGB-D videos across 46 common object categories. These +videos are taken with diverse cluttered backgrounds with three setups to cover +as many real-world scenarios as possible: (i) a single object in one video; +(ii) multiple objects in one video; and (iii) an object with a static hand in +one video. The dataset is annotated with object masks, real-world scale camera +poses, and reconstructed aggregated point clouds from RGBD videos. We benchmark +four tasks with WildRGB-D including novel view synthesis, camera pose +estimation, object 6d pose estimation, and object surface reconstruction. Our +experiments show that the large-scale capture of RGB-D objects provides a large +potential to advance 3D object learning. Our project page is +https://wildrgbd.github.io/.",cs.CV,['cs.CV'] +Learning Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification,Zhenyu Cui · Jiahuan Zhou · Xun Wang · Manyu Zhu · Yuxin Peng, ,https://arxiv.org/abs/2403.16003,,2403.16003.pdf,Diverse Representation Embedding for Lifelong Person Re-Identification,"Lifelong Person Re-Identification (LReID) aims to continuously learn from +successive data streams, matching individuals across multiple cameras. The key +challenge for LReID is how to effectively preserve old knowledge while +incrementally learning new information, which is caused by task-level domain +gaps and limited old task datasets. Existing methods based on CNN backbone are +insufficient to explore the representation of each instance from different +perspectives, limiting model performance on limited old task datasets and new +task datasets. Unlike these methods, we propose a Diverse Representations +Embedding (DRE) framework that first explores a pure transformer for LReID. The +proposed DRE preserves old knowledge while adapting to new information based on +instance-level and task-level layout. Concretely, an Adaptive Constraint Module +(ACM) is proposed to implement integration and push away operations between +multiple overlapping representations generated by transformer-based backbone, +obtaining rich and discriminative representations for each instance to improve +adaptive ability of LReID. Based on the processed diverse representations, we +propose Knowledge Update (KU) and Knowledge Preservation (KP) strategies at the +task-level layout by introducing the adjustment model and the learner model. KU +strategy enhances the adaptive learning ability of learner models for new +information under the adjustment model prior, and KP strategy preserves old +knowledge operated by representation-level alignment and logit-level +supervision in limited old task datasets while guaranteeing the adaptive +learning information capacity of the LReID model. Compared to state-of-the-art +methods, our method achieves significantly improved performance in holistic, +large-scale, and occluded datasets.",cs.CV,"['cs.CV', 'cs.AI']" +6-DoF Pose Estimation with MultiScale Residual Correlation,Yuelong Li · Yafei Mao · Raja Bala · Sunil Hadap,https://github.com/amzn/mrc-net-6d-pose,https://arxiv.org/abs/2403.08019,,2403.08019.pdf,MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation,"We propose a single-shot approach to determining 6-DoF pose of an object with +available 3D computer-aided design (CAD) model from a single RGB image. Our +method, dubbed MRC-Net, comprises two stages. The first performs pose +classification and renders the 3D object in the classified pose. The second +stage performs regression to predict fine-grained residual pose within class. +Connecting the two stages is a novel multi-scale residual correlation (MRC) +layer that captures high-and-low level correspondences between the input image +and rendering from first stage. MRC-Net employs a Siamese network with shared +weights between both stages to learn embeddings for input and rendered images. +To mitigate ambiguity when predicting discrete pose class labels on symmetric +objects, we use soft probabilistic labels to define pose class in the first +stage. We demonstrate state-of-the-art accuracy, outperforming all competing +RGB-based methods on four challenging BOP benchmark datasets: T-LESS, LM-O, +YCB-V, and ITODD. Our method is non-iterative and requires no complex +post-processing.",cs.CV,['cs.CV'] +Minimal Perspective Autocalibration,Andrea Porfiri Dal Cin · Timothy Duff · Luca Magri · Tomas Pajdla, ,https://arxiv.org/abs/2405.05605,,2405.05605.pdf,Minimal Perspective Autocalibration,"We introduce a new family of minimal problems for reconstruction from +multiple views. Our primary focus is a novel approach to autocalibration, a +long-standing problem in computer vision. Traditional approaches to this +problem, such as those based on Kruppa's equations or the modulus constraint, +rely explicitly on the knowledge of multiple fundamental matrices or a +projective reconstruction. In contrast, we consider a novel formulation +involving constraints on image points, the unknown depths of 3D points, and a +partially specified calibration matrix $K$. For $2$ and $3$ views, we present a +comprehensive taxonomy of minimal autocalibration problems obtained by relaxing +some of these constraints. These problems are organized into classes according +to the number of views and any assumed prior knowledge of $K$. Within each +class, we determine problems with the fewest -- or a relatively small number of +-- solutions. From this zoo of problems, we devise three practical solvers. +Experiments with synthetic and real data and interfacing our solvers with +COLMAP demonstrate that we achieve superior accuracy compared to +state-of-the-art calibration methods. The code is available at +https://github.com/andreadalcin/MinimalPerspectiveAutocalibration",cs.CV,['cs.CV'] +Improving Spectral Snapshot Reconstruction with Spectral-Spatial Rectification,Jiancheng Zhang · Haijin Zeng · Yongyong Chen · Dengxiu Yu · Yinping Zhao,https://github.com/ZhangJC-2k/SSR,,https://ieeexplore.ieee.org/document/10411766,,,,,nan +WorDepth: Variational Language Prior for Monocular Depth Estimation,Ziyao Zeng · Hyoungseob Park · Fengyu Yang · Daniel Wang · Stefano Soatto · Dong Lao · Alex Wong, ,https://arxiv.org/abs/2404.03635,,2404.03635.pdf,WorDepth: Variational Language Prior for Monocular Depth Estimation,"Three-dimensional (3D) reconstruction from a single image is an ill-posed +problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text +description(s) is similarly ill-posed, i.e. spatial arrangements of objects +described. We investigate the question of whether two inherently ambiguous +modalities can be used in conjunction to produce metric-scaled reconstructions. +To test this, we focus on monocular depth estimation, the problem of predicting +a dense depth map from a single image, but with an additional text caption +describing the scene. To this end, we begin by encoding the text caption as a +mean and standard deviation; using a variational framework, we learn the +distribution of the plausible metric reconstructions of 3D scenes corresponding +to the text captions as a prior. To ""select"" a specific reconstruction or depth +map, we encode the given image through a conditional sampler that samples from +the latent space of the variational text encoder, which is then decoded to the +output depth map. Our approach is trained alternatingly between the text and +image branches: in one optimization step, we predict the mean and standard +deviation from the text description and sample from a standard Gaussian, and in +the other, we sample using a (image) conditional sampler. Once trained, we +directly predict depth from the encoded text using the conditional sampler. We +demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where +we show that language can consistently improve performance in both.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.MM']" +Hierarchical Patch Diffusion Models for High-Resolution Video Generation,Ivan Skorokhodov · Willi Menapace · Aliaksandr Siarohin · Sergey Tulyakov, ,http://export.arxiv.org/abs/2310.19512,,2310.19512.pdf,VideoCrafter1: Open Diffusion Models for High-Quality Video Generation,"Video generation has increasingly gained interest in both academia and +industry. Although commercial tools can generate plausible videos, there is a +limited number of open-source models available for researchers and engineers. +In this work, we introduce two diffusion models for high-quality video +generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V +models synthesize a video based on a given text input, while I2V models +incorporate an additional image input. Our proposed T2V model can generate +realistic and cinematic-quality videos with a resolution of $1024 \times 576$, +outperforming other open-source T2V models in terms of quality. The I2V model +is designed to produce videos that strictly adhere to the content of the +provided reference image, preserving its content, structure, and style. This +model is the first open-source I2V foundation model capable of transforming a +given image into a video clip while maintaining content preservation +constraints. We believe that these open-source video generation models will +contribute significantly to the technological advancements within the +community.",cs.CV,['cs.CV'] +End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames,Shuming Liu · Chenlin Zhang · Chen Zhao · Bernard Ghanem, ,https://arxiv.org/abs/2311.17241,,2311.17241.pdf,End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames,"Recently, temporal action detection (TAD) has seen significant performance +improvement with end-to-end training. However, due to the memory bottleneck, +only models with limited scales and limited data volumes can afford end-to-end +training, which inevitably restricts TAD performance. In this paper, we reduce +the memory consumption for end-to-end training, and manage to scale up the TAD +backbone to 1 billion parameters and the input video to 1,536 frames, leading +to significant detection performance. The key to our approach lies in our +proposed temporal-informative adapter (TIA), which is a novel lightweight +module that reduces training memory. Using TIA, we free the humongous backbone +from learning to adapt to the TAD task by only updating the parameters in TIA. +TIA also leads to better TAD representation by temporally aggregating context +from adjacent frames throughout the backbone. We evaluate our model across four +representative datasets. Owing to our efficient design, we are able to train +end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the +first end-to-end model to outperform the best feature-based methods. Code is +available at https://github.com/sming256/AdaTAD.",cs.CV,['cs.CV'] +Dual DETRs for Multi-Label Temporal Action Detection,Yuhan Zhu · Guozhen Zhang · Jing Tan · Gangshan Wu · Limin Wang, ,https://arxiv.org/abs/2404.00653,,2404.00653.pdf,Dual DETRs for Multi-Label Temporal Action Detection,"Temporal Action Detection (TAD) aims to identify the action boundaries and +the corresponding category within untrimmed videos. Inspired by the success of +DETR in object detection, several methods have adapted the query-based +framework to the TAD task. However, these approaches primarily followed DETR to +predict actions at the instance level (i.e., identify each action by its center +point), leading to sub-optimal boundary localization. To address this issue, we +propose a new Dual-level query-based TAD framework, namely DualDETR, to detect +actions from both instance-level and boundary-level. Decoding at different +levels requires semantics of different granularity, therefore we introduce a +two-branch decoding structure. This structure builds distinctive decoding +processes for different levels, facilitating explicit capture of temporal cues +and semantics at each level. On top of the two-branch design, we present a +joint query initialization strategy to align queries from both levels. +Specifically, we leverage encoder proposals to match queries from each level in +a one-to-one manner. Then, the matched queries are initialized using position +and content prior from the matched action proposal. The aligned dual-level +queries can refine the matched proposal with complementary cues during +subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD +benchmarks. The experimental results demonstrate the superior performance of +DualDETR to the existing state-of-the-art methods, achieving a substantial +improvement under det-mAP and delivering impressive results under seg-mAP.",cs.CV,['cs.CV'] +LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model,Chenjie Cao · Yunuo Cai · Qiaole Dong · Yikai Wang · Yanwei Fu,https://ewrfcas.github.io/LeftRefill/,https://arxiv.org/html/2405.18416v1,,2405.18416v1.pdf,3D StreetUnveiler with Semantic-Aware 2DGS,"Unveiling an empty street from crowded observations captured by in-car +cameras is crucial for autonomous driving. However, removing all temporary +static objects, such as stopped vehicles and standing pedestrians, presents a +significant challenge. Unlike object-centric 3D inpainting, which relies on +thorough observation in a small scene, street scenes involve long trajectories +that differ from previous 3D inpainting tasks. The camera-centric moving +environment of captured videos further complicates the task due to the limited +degree and time duration of object observation. To address these obstacles, we +introduce StreetUnveiler to reconstruct an empty street. StreetUnveiler learns +a 3D representation of the empty street from crowded observations. Our +representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS) +for its scalability and ability to identify Gaussians to be removed. We inpaint +rendered image after removing unwanted Gaussians to provide pseudo-labels and +subsequently re-optimize the 2DGS. Given its temporal continuous movement, we +divide the empty street scene into observed, partial-observed, and unobserved +regions, which we propose to locate through a rendered alpha map. This +decomposition helps us to minimize the regions that need to be inpainted. To +enhance the temporal consistency of the inpainting, we introduce a novel +time-reversal framework to inpaint frames in reverse order and use later frames +as references for earlier frames to fully utilize the long-trajectory +observations. Our experiments conducted on the street scene dataset +successfully reconstructed a 3D representation of the empty street. The mesh +representation of the empty street can be extracted for further applications. +Project page and more visualizations can be found at: +https://streetunveiler.github.io",cs.CV,['cs.CV'] +3DiffTection: 3D Object Detection with Geometry-aware Diffusion Features,Chenfeng Xu · Huan Ling · Sanja Fidler · Or Litany, ,https://arxiv.org/abs/2311.04391,,2311.04391.pdf,3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features,"We present 3DiffTection, a state-of-the-art method for 3D object detection +from single images, leveraging features from a 3D-aware diffusion model. +Annotating large-scale image data for 3D detection is resource-intensive and +time-consuming. Recently, pretrained large image diffusion models have become +prominent as effective feature extractors for 2D perception tasks. However, +these features are initially trained on paired text and image data, which are +not optimized for 3D tasks, and often exhibit a domain gap when applied to the +target data. Our approach bridges these gaps through two specialized tuning +strategies: geometric and semantic. For geometric tuning, we fine-tune a +diffusion model to perform novel view synthesis conditioned on a single image, +by introducing a novel epipolar warp operator. This task meets two essential +criteria: the necessity for 3D awareness and reliance solely on posed image +data, which are readily available (e.g., from videos) and does not require +manual annotation. For semantic refinement, we further train the model on +target data with detection supervision. Both tuning phases employ ControlNet to +preserve the integrity of the original feature capabilities. In the final step, +we harness these enhanced capabilities to conduct a test-time prediction +ensemble across multiple virtual viewpoints. Through our methodology, we obtain +3D-aware features that are tailored for 3D detection and excel in identifying +cross-view point correspondences. Consequently, our model emerges as a powerful +3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a +precedent in single-view 3D detection by 9.43\% in AP3D on the +Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data +efficiency and generalization to cross-domain data.",cs.CV,['cs.CV'] +Unsupervised Feature Learning with Emergent Data-Driven Prototypicality,Yunhui Guo · Youren Zhang · Yubei Chen · Stella X. Yu, ,https://arxiv.org/abs/2307.01421,,2307.01421.pdf,Unsupervised Feature Learning with Emergent Data-Driven Prototypicality,"Given an image set without any labels, our goal is to train a model that maps +each image to a point in a feature space such that, not only proximity +indicates visual similarity, but where it is located directly encodes how +prototypical the image is according to the dataset. + Our key insight is to perform unsupervised feature learning in hyperbolic +instead of Euclidean space, where the distance between points still reflect +image similarity, and yet we gain additional capacity for representing +prototypicality with the location of the point: The closer it is to the origin, +the more prototypical it is. The latter property is simply emergent from +optimizing the usual metric learning objective: The image similar to many +training instances is best placed at the center of corresponding points in +Euclidean space, but closer to the origin in hyperbolic space. + We propose an unsupervised feature learning algorithm in Hyperbolic space +with sphere pACKing. HACK first generates uniformly packed particles in the +Poincar\'e ball of hyperbolic space and then assigns each image uniquely to +each particle. Images after congealing are regarded more typical of the dataset +it belongs to. With our feature mapper simply trained to spread out training +instances in hyperbolic space, we observe that images move closer to the origin +with congealing, validating our idea of unsupervised prototypicality discovery. +We demonstrate that our data-driven prototypicality provides an easy and +superior unsupervised instance selection to reduce sample complexity, increase +model generalization with atypical instances and robustness with typical ones.",cs.CV,"['cs.CV', 'cs.AI']" +Visual In-Context Prompting,Feng Li · Qing Jiang · Hao Zhang · Shilong Liu · Huaizhe Xu · Xueyan Zou · Tianhe Ren · Hongyang Li · Lei Zhang · Chunyuan Li · Jianwei Yang · Jianfeng Gao, ,https://arxiv.org/abs/2311.13601,,2311.13601.pdf,Visual In-Context Prompting,"In-context prompting in large language models (LLMs) has become a prevalent +approach to improve zero-shot capabilities, but this idea is less explored in +the vision domain. Existing visual prompting methods focus on referring +segmentation to segment the most relevant object, falling short of addressing +many generic vision tasks like open-set segmentation and detection. In this +paper, we introduce a universal visual in-context prompting framework for both +tasks. In particular, we build on top of an encoder-decoder architecture, and +develop a versatile prompt encoder to support a variety of prompts like +strokes, boxes, and points. We further enhance it to take an arbitrary number +of reference image segments as the context. Our extensive explorations show +that the proposed visual in-context prompting elicits extraordinary referring +and generic segmentation capabilities to refer and detect, yielding competitive +performance to close-set in-domain datasets and showing promising results on +many open-set segmentation datasets. By joint training on COCO and SA-1B, our +model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be +available at https://github.com/UX-Decoder/DINOv.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity,Yuhang Chen · Wenke Huang · Mang Ye,https://github.com/yuhangchen0/FedHEAL,https://arxiv.org/abs/2405.16585,,2405.16585.pdf,Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity,"Federated learning (FL) has emerged as a new paradigm for privacy-preserving +collaborative training. Under domain skew, the current FL approaches are biased +and face two fairness problems. 1) Parameter Update Conflict: data disparity +among clients leads to varying parameter importance and inconsistent update +directions. These two disparities cause important parameters to potentially be +overwhelmed by unimportant ones of dominant updates. It consequently results in +significant performance decreases for lower-performing clients. 2) Model +Aggregation Bias: existing FL approaches introduce unfair weight allocation and +neglect domain diversity. It leads to biased model convergence objective and +distinct performance among domains. We discover a pronounced directional update +consistency in Federated Learning and propose a novel framework to tackle above +issues. First, leveraging the discovered characteristic, we selectively discard +unimportant parameter updates to prevent updates from clients with lower +performance overwhelmed by unimportant parameters, resulting in fairer +generalization performance. Second, we propose a fair aggregation objective to +prevent global model bias towards some domains, ensuring that the global model +continuously aligns with an unbiased model. The proposed method is generic and +can be combined with other existing FL methods to enhance fairness. +Comprehensive experiments on Digits and Office-Caltech demonstrate the high +fairness and performance of our method.",cs.LG,"['cs.LG', 'cs.AI']" +Reg-PTQ: Regression-specialized Post-training Quantization for Fully Quantized Object Detector,Yifu Ding · Weilun Feng · Chuyan Chen · Jinyang Guo · Xianglong Liu, ,,,,,,,nan +MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion,Roy Kapon · Guy Tevet · Daniel Cohen-Or · Amit H. Bermano, ,https://arxiv.org/abs/2310.14729,,2310.14729.pdf,MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion,"We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion +generation, using 2D diffusion models that were trained on motions obtained +from in-the-wild videos. As such, MAS opens opportunities to exciting and +diverse fields of motion previously under-explored as 3D data is scarce and +hard to collect. MAS works by simultaneously denoising multiple 2D motion +sequences representing different views of the same 3D motion. It ensures +consistency across all views at each diffusion step by combining the individual +generations into a unified 3D sequence, and projecting it back to the original +views. We demonstrate MAS on 2D pose data acquired from videos depicting +professional basketball maneuvers, rhythmic gymnastic performances featuring a +ball apparatus, and horse races. In each of these domains, 3D motion capture is +arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the +Score Distillation approach, which optimizes each sample by repeatedly applying +small fixes, our method uses a sampling process that was constructed for the +diffusion framework. As we demonstrate, MAS avoids common issues such as +out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/",cs.CV,"['cs.CV', 'cs.GR']" +PEEKABOO: Interactive Video Generation via Masked-Diffusion,Yash Jain · Anshul Nasery · Vibhav Vineet · Harkirat Behl, ,https://arxiv.org/abs/2312.07509,,2312.07509.pdf,PEEKABOO: Interactive Video Generation via Masked-Diffusion,"Modern video generation models like Sora have achieved remarkable success in +producing high-quality videos. However, a significant limitation is their +inability to offer interactive control to users, a feature that promises to +open up unprecedented applications and creativity. In this work, we introduce +the first solution to equip diffusion-based video generation models with +spatio-temporal control. We present Peekaboo, a novel masked attention module, +which seamlessly integrates with current video generation models offering +control without the need for additional training or inference overhead. To +facilitate future research, we also introduce a comprehensive benchmark for +interactive video generation. This benchmark offers a standardized framework +for the community to assess the efficacy of emerging interactive video +generation models. Our extensive qualitative and quantitative assessments +reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline +models, all while maintaining the same latency. Code and benchmark are +available on the webpage.",cs.CV,"['cs.CV', 'cs.LG']" +Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization,Khiem Le · Tuan Long Ho · Cuong Do · Danh Le-Phuoc · KOK SENG WONG, ,https://arxiv.org/abs/2403.15605,,2403.15605.pdf,Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization,"Domain shift is a formidable issue in Machine Learning that causes a model to +suffer from performance degradation when tested on unseen domains. Federated +Domain Generalization (FedDG) attempts to train a global model using +collaborative clients in a privacy-preserving manner that can generalize well +to unseen clients possibly with domain shift. However, most existing FedDG +methods either cause additional privacy risks of data leakage or induce +significant costs in client communication and computation, which are major +concerns in the Federated Learning paradigm. To circumvent these challenges, +here we introduce a novel architectural method for FedDG, namely gPerXAN, which +relies on a normalization scheme working with a guiding regularizer. In +particular, we carefully design Personalized eXplicitly Assembled Normalization +to enforce client models selectively filtering domain-specific features that +are biased towards local data while retaining discrimination of those features. +Then, we incorporate a simple yet effective regularizer to guide these models +in directly capturing domain-invariant representations that the global model's +classifier can leverage. Extensive experimental results on two benchmark +datasets, i.e., PACS and Office-Home, and a real-world medical dataset, +Camelyon17, indicate that our proposed method outperforms other existing +methods in addressing this particular problem.",cs.CV,"['cs.CV', 'cs.LG']" +S$^2$MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering,Zhen Long · Qiyuan Wang · Yazhou Ren · Yipeng Liu · Ce Zhu, ,https://arxiv.org/abs/2403.09107,,2403.09107.pdf,S^2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering,"Anchor-based large-scale multi-view clustering has attracted considerable +attention for its effectiveness in handling massive datasets. However, current +methods mainly seek the consensus embedding feature for clustering by exploring +global correlations between anchor graphs or projection matrices.In this paper, +we propose a simple yet efficient scalable multi-view tensor clustering +(S^2MVTC) approach, where our focus is on learning correlations of embedding +features within and across views. Specifically, we first construct the +embedding feature tensor by stacking the embedding features of different views +into a tensor and rotating it. Additionally, we build a novel tensor +low-frequency approximation (TLFA) operator, which incorporates graph +similarity into embedding feature learning, efficiently achieving smooth +representation of embedding features within different views. Furthermore, +consensus constraints are applied to embedding features to ensure inter-view +semantic consistency. Experimental results on six large-scale multi-view +datasets demonstrate that S^2MVTC significantly outperforms state-of-the-art +algorithms in terms of clustering performance and CPU execution time, +especially when handling massive data. The code of S^2MVTC is publicly +available at https://github.com/longzhen520/S2MVTC.",cs.LG,"['cs.LG', 'cs.CV']" +LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content,Qihao Zhao · Yalun Dai · Hao Li · Wei Hu · Fan Zhang · Jun Liu, ,https://arxiv.org/abs/2403.05854,,2403.05854.pdf,LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content,"Long-tail recognition is challenging because it requires the model to learn +good representations from tail categories and address imbalances across all +categories. In this paper, we propose a novel generative and fine-tuning +framework, LTGC, to handle long-tail recognition via leveraging generated +content. Firstly, inspired by the rich implicit knowledge in large-scale models +(e.g., large language models, LLMs), LTGC leverages the power of these models +to parse and reason over the original tail data to produce diverse tail-class +content. We then propose several novel designs for LTGC to ensure the quality +of the generated data and to efficiently fine-tune the model using both the +generated and original data. The visualization demonstrates the effectiveness +of the generation module in LTGC, which produces accurate and diverse tail +data. Additionally, the experimental results demonstrate that our LTGC +outperforms existing state-of-the-art methods on popular long-tailed +benchmarks.",cs.CV,['cs.CV'] +BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation,Yunhao Ge · Yihe Tang · Jiashu Xu · Cem Gokmen · Chengshu Li · Wensi Ai · Benjamin Martinez · Arman Aydin · Mona Anvari · Ayush Chakravarthy · Hong-Xing Yu · Josiah Wong · Sanjana Srivastava · Sharon Lee · Shengxin Zha · Laurent Itti · Yunzhu Li · Roberto Martín-Martín · Miao Liu · Pengchuan Zhang · Ruohan Zhang · Li Fei-Fei · Jiajun Wu, ,,,,,,,nan +Relightful Harmonization: Lighting-aware Portrait Background Replacement,Mengwei Ren · Wei Xiong · Jae Shin Yoon · Zhixin Shu · Jianming Zhang · HyunJoon Jung · Guido Gerig · He Zhang, ,https://arxiv.org/abs/2312.06886,,2312.06886.pdf,Relightful Harmonization: Lighting-aware Portrait Background Replacement,"Portrait harmonization aims to composite a subject into a new background, +adjusting its lighting and color to ensure harmony with the background scene. +Existing harmonization techniques often only focus on adjusting the global +color and brightness of the foreground and ignore crucial illumination cues +from the background such as apparent lighting direction, leading to unrealistic +compositions. We introduce Relightful Harmonization, a lighting-aware diffusion +model designed to seamlessly harmonize sophisticated lighting effect for the +foreground portrait using any background image. Our approach unfolds in three +stages. First, we introduce a lighting representation module that allows our +diffusion model to encode lighting information from target image background. +Second, we introduce an alignment network that aligns lighting features learned +from image background with lighting features learned from panorama environment +maps, which is a complete representation for scene illumination. Last, to +further boost the photorealism of the proposed method, we introduce a novel +data simulation pipeline that generates synthetic training pairs from a diverse +range of natural images, which are used to refine the model. Our method +outperforms existing benchmarks in visual fidelity and lighting coherence, +showing superior generalization in real-world testing scenarios, highlighting +its versatility and practicality.",cs.CV,['cs.CV'] +Image Processing GNN: Breaking Rigidity in Super-Resolution,Yuchuan Tian · Hanting Chen · Chao Xu · Yunhe Wang, ,https://arxiv.org/abs/2310.10413,,2310.10413.pdf,Image super-resolution via dynamic network,"Convolutional neural networks (CNNs) depend on deep network architectures to +extract accurate information for image super-resolution. However, obtained +information of these CNNs cannot completely express predicted high-quality +images for complex scenes. In this paper, we present a dynamic network for +image super-resolution (DSRNet), which contains a residual enhancement block, +wide enhancement block, feature refinement block and construction block. The +residual enhancement block is composed of a residual enhanced architecture to +facilitate hierarchical features for image super-resolution. To enhance +robustness of obtained super-resolution model for complex scenes, a wide +enhancement block achieves a dynamic architecture to learn more robust +information to enhance applicability of an obtained super-resolution model for +varying scenes. To prevent interference of components in a wide enhancement +block, a refinement block utilizes a stacked architecture to accurately learn +obtained features. Also, a residual learning operation is embedded in the +refinement block to prevent long-term dependency problem. Finally, a +construction block is responsible for reconstructing high-quality images. +Designed heterogeneous architecture can not only facilitate richer structural +information, but also be lightweight, which is suitable for mobile digital +devices. Experimental results shows that our method is more competitive in +terms of performance and recovering time of image super-resolution and +complexity. The code of DSRNet can be obtained at +https://github.com/hellloxiaotian/DSRNet.",eess.IV,"['eess.IV', 'cs.CV']" +TexTile: A Differentiable Metric for Texture Tileability,Carlos Rodriguez-Pardo · Dan Casas · Elena Garces · Jorge Lopez-Moreno,https://mslab.es/projects/TexTile/,,,,,,,nan +GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning,Ye Yuan · Xueting Li · Yangyi Huang · Shalini De Mello · Koki Nagano · Jan Kautz · Umar Iqbal,https://nvlabs.github.io/GAvatar/,https://arxiv.org/abs/2312.11461,,2312.11461.pdf,GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning,"Gaussian splatting has emerged as a powerful 3D representation that harnesses +the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. +In this paper, we seek to leverage Gaussian splatting to generate realistic +animatable avatars from textual descriptions, addressing the limitations (e.g., +flexibility and efficiency) imposed by mesh or NeRF-based representations. +However, a naive application of Gaussian splatting cannot generate high-quality +animatable avatars and suffers from learning instability; it also cannot +capture fine avatar geometries and often leads to degenerate body parts. To +tackle these problems, we first propose a primitive-based 3D Gaussian +representation where Gaussians are defined inside pose-driven primitives to +facilitate animation. Second, to stabilize and amortize the learning of +millions of Gaussians, we propose to use neural implicit fields to predict the +Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries +and extract detailed meshes, we propose a novel SDF-based implicit mesh +learning approach for 3D Gaussians that regularizes the underlying geometries +and extracts highly detailed textured meshes. Our proposed method, GAvatar, +enables the large-scale generation of diverse animatable avatars using only +text prompts. GAvatar significantly surpasses existing methods in terms of both +appearance and geometry quality, and achieves extremely fast rendering (100 +fps) at 1K resolution.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Instance Tracking in 3D Scenes from Egocentric Videos,Yunhan Zhao · Haoyu Ma · Shu Kong · Charless Fowlkes,https://github.com/IT3DEgo/IT3DEgo/,https://arxiv.org/abs/2312.04117,,2312.04117.pdf,Instance Tracking in 3D Scenes from Egocentric Videos,"Egocentric sensors such as AR/VR devices capture human-object interactions +and offer the potential to provide task-assistance by recalling 3D locations of +objects of interest in the surrounding environment. This capability requires +instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We +explore this problem by first introducing a new benchmark dataset, consisting +of RGB and depth videos, per-frame camera pose, and instance-level annotations +in both 2D camera and 3D world coordinates. We present an evaluation protocol +which evaluates tracking performance in 3D coordinates with two settings for +enrolling instances to track: (1) single-view online enrollment where an +instance is specified on-the-fly based on the human wearer's interactions. and +(2) multi-view pre-enrollment where images of an instance to be tracked are +stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods +from relevant areas, e.g., single object tracking (SOT) -- running SOT methods +to track instances in 2D frames and lifting them to 3D using camera pose and +depth. We also present a simple method that leverages pretrained segmentation +and detection models to generate proposals from RGB frames and match proposals +with enrolled instance images. Perhaps surprisingly, our extensive experiments +show that our method (with no finetuning) significantly outperforms SOT-based +approaches. We conclude by arguing that the problem of egocentric instance +tracking is made easier by leveraging camera pose and using a 3D allocentric +(world) coordinate representation.",cs.CV,['cs.CV'] +ViT-Lens: Towards Omni-modal Representations,Stan Weixian Lei · Yixiao Ge · Kun Yi · Jianfeng Zhang · Difei Gao · Dylan Sun · Yuying Ge · Ying Shan · Mike Zheng Shou, ,https://arxiv.org/abs/2311.16081,,2311.16081.pdf,ViT-Lens: Towards Omni-modal Representations,"Aiming to advance AI agents, large foundation models significantly improve +reasoning and instruction execution, yet the current focus on vision and +language neglects the potential of perceiving diverse modalities in open-world +environments. However, the success of data-driven vision and language models is +costly or even infeasible to be reproduced for rare modalities. In this paper, +we present ViT-Lens-2 that facilitates efficient omni-modal representation +learning by perceiving novel modalities with a pretrained ViT and aligning them +to a pre-defined space. Specifically, the modality-specific lens is tuned to +project any-modal signals to an intermediate embedding space, which are then +processed by a strong ViT with pre-trained visual knowledge. The encoded +representations are optimized toward aligning with the modal-independent space, +pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified +solution for representation learning of increasing modalities with two +appealing advantages: (i) Unlocking the great potential of pretrained ViTs to +novel modalities effectively with efficient data regime; (ii) Enabling emergent +downstream capabilities through modality alignment and shared ViT parameters. +We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, +tactile and EEG, and set new state-of-the-art results across various +understanding tasks, such as zero-shot classification. By seamlessly +integrating ViT-Lens-2 into Multimodal Foundation Models, we enable +Any-modality to Text and Image Generation in a zero-shot manner. Code and +models are available at https://github.com/TencentARC/ViT-Lens.",cs.CV,"['cs.CV', 'cs.AI']" +VideoDistill: Language-aware Vision Distillation for Video Question Answering,Bo Zou · Chao Yang · Yu Qiao · Chengbin Quan · Youjian Zhao, ,https://arxiv.org/abs/2404.00973,,2404.00973.pdf,VideoDistill: Language-aware Vision Distillation for Video Question Answering,"Significant advancements in video question answering (VideoQA) have been made +thanks to thriving large image-language pretraining frameworks. Although these +image-language models can efficiently represent both video and language +branches, they typically employ a goal-free vision perception process and do +not interact vision with language well during the answer generation, thus +omitting crucial visual cues. In this paper, we are inspired by the human +recognition and learning pattern and propose VideoDistill, a framework with +language-aware (i.e., goal-driven) behavior in both vision perception and +answer generation process. VideoDistill generates answers only from +question-related visual embeddings and follows a thinking-observing-answering +approach that closely resembles human behavior, distinguishing it from previous +research. Specifically, we develop a language-aware gating mechanism to replace +the standard cross-attention, avoiding language's direct fusion into visual +representations. We incorporate this mechanism into two key components of the +entire framework. The first component is a differentiable sparse sampling +module, which selects frames containing the necessary dynamics and semantics +relevant to the questions. The second component is a vision refinement module +that merges existing spatial-temporal attention layers to ensure the extraction +of multi-grained visual semantics associated with the questions. We conduct +experimental evaluations on various challenging video question-answering +benchmarks, and VideoDistill achieves state-of-the-art performance in both +general and long-form VideoQA datasets. In Addition, we verify that +VideoDistill can effectively alleviate the utilization of language shortcut +solutions in the EgoTaskQA dataset.",cs.CV,['cs.CV'] +Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model,Zelin Peng · Zhengqin Xu · Zhilin Zeng · Lingxi Xie · Qi Tian · Wei Shen, ,https://arxiv.org/abs/2311.17112,,2311.17112.pdf,Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model,"Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash +the potential of large foundation models in novel scenarios with limited +training data. In the computer vision community, PEFT has shown effectiveness +in image classification, but little research has studied its ability for image +segmentation. Fine-tuning segmentation models usually require a heavier +adjustment of parameters to align the proper projection directions in the +parameter space for new scenarios. This raises a challenge to existing PEFT +algorithms, as they often inject a limited number of individual parameters into +each block, which prevents substantial adjustment of the projection direction +of the parameter space due to the limitation of Hidden Markov Chain along +blocks. In this paper, we equip PEFT with a cross-block orchestration mechanism +to enable the adaptation of the Segment Anything Model (SAM) to various +downstream scenarios. We introduce a novel inter-block communication module, +which integrates a learnable relation matrix to facilitate communication among +different coefficient sets of each PEFT block's parameter space. Moreover, we +propose an intra-block enhancement module, which introduces a linear projection +head whose weights are generated from a hyper-complex layer, further enhancing +the impact of the adjustment of projection directions on the entire parameter +space. Extensive experiments on diverse benchmarks demonstrate that our +proposed approach consistently improves the segmentation performance +significantly on novel scenarios with only around 1K additional parameters.",cs.CV,['cs.CV'] +Generating Illustrated Instructions,Sachit Menon · Ishan Misra · Rohit Girdhar, ,https://arxiv.org/abs/2312.04552,,2312.04552.pdf,Generating Illustrated Instructions,"We introduce the new task of generating Illustrated Instructions, i.e., +visual instructions customized to a user's needs. We identify desiderata unique +to this task, and formalize it through a suite of automatic and human +evaluation metrics, designed to measure the validity, consistency, and efficacy +of the generations. We combine the power of large language models (LLMs) +together with strong text-to-image generation diffusion models to propose a +simple approach called StackedDiffusion, which generates such illustrated +instructions given text as input. The resulting model strongly outperforms +baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, +users even prefer it to human-generated articles. Most notably, it enables +various new and exciting applications far beyond what static articles on the +web can provide, such as personalized instructions complete with intermediate +steps and pictures in response to a user's individual situation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']" +3D-LFM: Lifting Foundation Model,Mosam Dabhi · László A. Jeni · Simon Lucey, ,https://arxiv.org/abs/2312.11894,,2312.11894.pdf,3D-LFM: Lifting Foundation Model,"The lifting of 3D structure and camera from 2D landmarks is at the +cornerstone of the entire discipline of computer vision. Traditional methods +have been confined to specific rigid objects, such as those in +Perspective-n-Point (PnP) problems, but deep learning has expanded our +capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL) +with resilience to noise, occlusions, and perspective distortions. All these +techniques, however, have been limited by the fundamental need to establish +correspondences across the 3D training data -- significantly limiting their +utility to applications where one has an abundance of ""in-correspondence"" 3D +data. Our approach harnesses the inherent permutation equivariance of +transformers to manage varying number of points per 3D data instance, +withstands occlusions, and generalizes to unseen categories. We demonstrate +state of the art performance across 2D-3D lifting task benchmarks. Since our +approach can be trained across such a broad class of structures we refer to it +simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction,Bo Zou · Chao Yang · Yu Qiao · Chengbin Quan · Youjian Zhao, ,https://arxiv.org/abs/2404.00913v1,,2404.00913v1.pdf,LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction,"Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, +which introduce extra modules or additional input sequences to inject new +skills or knowledge, may compromise the innate abilities of LLMs. In this +paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' +potential to better follow instructions by gradually paying more attention to +worthwhile information. Specifically, the LLaMA-Excitor does not directly +change the intermediate hidden state during the self-attention calculation of +the transformer structure. We designed the Excitor block as a bypass module for +the similarity score computation in LLMs' self-attention to reconstruct keys +and change the importance of values by learnable prompts. LLaMA-Excitor ensures +a self-adaptive allocation of additional attention to input instructions, thus +effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on +low-quality instruction-following datasets. Furthermore, we unify the modeling +of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a +powerful visual instruction follower without the need for complex multi-modal +alignment. Our proposed approach is evaluated in language-only and multi-modal +tuning experimental scenarios. Notably, LLaMA-Excitor is the only method that +maintains basic capabilities while achieving a significant improvement (+6%) on +the MMLU benchmark. In the visual instruction tuning, we achieve a new +state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a +comparable performance (88.39%) on ScienceQA to cutting-edge models with more +parameters and extensive vision-language pertaining.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation,Dale Decatur · Itai Lang · Kfir Aberman · Rana Hanocka, ,https://arxiv.org/abs/2311.09571,,2311.09571.pdf,3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation,"In this work we develop 3D Paintbrush, a technique for automatically +texturing local semantic regions on meshes via text descriptions. Our method is +designed to operate directly on meshes, producing texture maps which seamlessly +integrate into standard graphics pipelines. We opt to simultaneously produce a +localization map (to specify the edit region) and a texture map which conforms +to it. This synergistic approach improves the quality of both the localization +and the stylization. To enhance the details and resolution of the textured +area, we leverage multiple stages of a cascaded diffusion model to supervise +our local editing technique with generative priors learned from images at +different resolutions. Our technique, referred to as Cascaded Score +Distillation (CSD), simultaneously distills scores at multiple resolutions in a +cascaded fashion, enabling control over both the granularity and global +understanding of the supervision. We demonstrate the effectiveness of 3D +Paintbrush to locally texture a variety of shapes within different semantic +regions. Project page: https://threedle.github.io/3d-paintbrush",cs.GR,"['cs.GR', 'cs.CV']" +Epistemic Uncertainty Quantification For Pre-trained Neural Networks,Hanjing Wang · Qiang Ji, ,https://arxiv.org/abs/2404.10124,,2404.10124.pdf,Epistemic Uncertainty Quantification For Pre-trained Neural Network,"Epistemic uncertainty quantification (UQ) identifies where models lack +knowledge. Traditional UQ methods, often based on Bayesian neural networks, are +not suitable for pre-trained non-Bayesian models. Our study addresses +quantifying epistemic uncertainty for any pre-trained model, which does not +need the original training data or model modifications and can ensure broad +applicability regardless of network architectures or training techniques. +Specifically, we propose a gradient-based approach to assess epistemic +uncertainty, analyzing the gradients of outputs relative to model parameters, +and thereby indicating necessary model adjustments to accurately represent the +inputs. We first explore theoretical guarantees of gradient-based methods for +epistemic UQ, questioning the view that this uncertainty is only calculable +through differences between multiple models. We further improve gradient-driven +UQ by using class-specific weights for integrating gradients and emphasizing +distinct contributions from neural network layers. Additionally, we enhance UQ +accuracy by combining gradient and perturbation methods to refine the +gradients. We evaluate our approach on out-of-distribution detection, +uncertainty calibration, and active learning, demonstrating its superiority +over current state-of-the-art UQ methods for pre-trained models.",cs.LG,"['cs.LG', 'cs.CV']" +Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Anthropic Prior Knowledge,Bo Zou · Shaofeng Wang · Hao Liu · Gaoyue Sun · Yajie Wang · Zuo FeiFei · Chengbin Quan · Youjian Zhao, ,,https://paperswithcode.com/paper/teeth-seg-an-efficient-instance-segmentation,,,,,nan +Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer,Junyi Wu · Bin Duan · Weitai Kang · Hao Tang · Yan Yan, ,https://arxiv.org/abs/2403.14552,,2403.14552.pdf,Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer,"While Transformers have rapidly gained popularity in various computer vision +applications, post-hoc explanations of their internal mechanisms remain largely +unexplored. Vision Transformers extract visual information by representing +image regions as transformed tokens and integrating them via attention weights. +However, existing post-hoc explanation methods merely consider these attention +weights, neglecting crucial information from the transformed tokens, which +fails to accurately illustrate the rationales behind the models' predictions. +To incorporate the influence of token transformation into interpretation, we +propose TokenTM, a novel post-hoc explanation method that utilizes our +introduced measurement of token transformation effects. Specifically, we +quantify token transformation effects by measuring changes in token lengths and +correlations in their directions pre- and post-transformation. Moreover, we +develop initialization and aggregation rules to integrate both attention +weights and token transformation effects across all layers, capturing holistic +token contributions throughout the model. Experimental results on segmentation +and perturbation tests demonstrate the superiority of our proposed TokenTM +compared to state-of-the-art Vision Transformer explanation methods.",cs.CV,['cs.CV'] +Global Latent Neural Rendering,Thomas Tanay · Matteo Maggioni, ,https://arxiv.org/abs/2312.08338,,2312.08338.pdf,Global Latent Neural Rendering,"A recent trend among generalizable novel view synthesis methods is to learn a +rendering operator acting over single camera rays. This approach is promising +because it removes the need for explicit volumetric rendering, but it +effectively treats target images as collections of independent pixels. Here, we +propose to learn a global rendering operator acting over all camera rays +jointly. We show that the right representation to enable such rendering is a +5-dimensional plane sweep volume consisting of the projection of the input +images on a set of planes facing the target camera. Based on this +understanding, we introduce our Convolutional Global Latent Renderer (ConvGLR), +an efficient convolutional architecture that performs the rendering operation +globally in a low-resolution latent space. Experiments on various datasets +under sparse and generalizable setups show that our approach consistently +outperforms existing methods by significant margins.",cs.CV,['cs.CV'] +MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers,Yawar Siddiqui · Antonio Alliegro · Alexey Artemov · Tatiana Tommasi · Daniele Sirigatti · Vladislav Rosov · Angela Dai · Matthias Nießner, ,https://arxiv.org/abs/2311.15475,,2311.15475.pdf,MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers,"We introduce MeshGPT, a new approach for generating triangle meshes that +reflects the compactness typical of artist-created meshes, in contrast to dense +triangle meshes extracted by iso-surfacing methods from neural fields. Inspired +by recent advances in powerful large language models, we adopt a sequence-based +approach to autoregressively generate triangle meshes as sequences of +triangles. We first learn a vocabulary of latent quantized embeddings, using +graph convolutions, which inform these embeddings of the local mesh geometry +and topology. These embeddings are sequenced and decoded into triangles by a +decoder, ensuring that they can effectively reconstruct the mesh. A transformer +is then trained on this learned vocabulary to predict the index of the next +embedding given previous embeddings. Once trained, our model can be +autoregressively sampled to generate new triangle meshes, directly generating +compact meshes with sharp edges, more closely imitating the efficient +triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable +improvement over state of the art mesh generation methods, with a 9% increase +in shape coverage and a 30-point enhancement in FID scores across various +categories.",cs.CV,"['cs.CV', 'cs.LG']" +Video Recognition in Portrait Mode,Mingfei Han · Linjie Yang · Xiaojie Jin · Jiashi Feng · Xiaojun Chang · Heng Wang, ,https://arxiv.org/abs/2312.13746v1,,2312.13746v1.pdf,Video Recognition in Portrait Mode,"The creation of new datasets often presents new challenges for video +recognition and can inspire novel ideas while addressing these challenges. +While existing datasets mainly comprise landscape mode videos, our paper seeks +to introduce portrait mode videos to the research community and highlight the +unique challenges associated with this video format. With the growing +popularity of smartphones and social media applications, recognizing portrait +mode videos is becoming increasingly important. To this end, we have developed +the first dataset dedicated to portrait mode video recognition, namely +PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a +data-driven manner, comprising 400 fine-grained categories, and rigorous +quality assurance was implemented to ensure the accuracy of human annotations. +In addition to the new dataset, we conducted a comprehensive analysis of the +impact of video format (portrait mode versus landscape mode) on recognition +accuracy and spatial bias due to the different formats. Furthermore, we +designed extensive experiments to explore key aspects of portrait mode video +recognition, including the choice of data augmentation, evaluation procedure, +the importance of temporal information, and the role of audio modality. +Building on the insights from our experimental results and the introduction of +PortraitMode-400, our paper aims to inspire further research efforts in this +emerging research area.",cs.CV,['cs.CV'] +VGGSfM: Visual Geometry Grounded Deep Structure From Motion,Jianyuan Wang · Nikita Karaev · Christian Rupprecht · David Novotny, ,https://arxiv.org/abs/2312.04563,,2312.04563.pdf,Visual Geometry Grounded Deep Structure From Motion,"Structure-from-motion (SfM) is a long-standing problem in the computer vision +community, which aims to reconstruct the camera poses and 3D structure of a +scene from a set of unconstrained 2D images. Classical frameworks solve this +problem in an incremental manner by detecting and matching keypoints, +registering images, triangulating 3D points, and conducting bundle adjustment. +Recent research efforts have predominantly revolved around harnessing the power +of deep learning techniques to enhance specific elements (e.g., keypoint +matching), but are still based on the original, non-differentiable pipeline. +Instead, we propose a new deep pipeline VGGSfM, where each component is fully +differentiable and thus can be trained in an end-to-end manner. To this end, we +introduce new mechanisms and simplifications. First, we build on recent +advances in deep 2D point tracking to extract reliable pixel-accurate tracks, +which eliminates the need for chaining pairwise matches. Furthermore, we +recover all cameras simultaneously based on the image and track features +instead of gradually registering cameras. Finally, we optimise the cameras and +triangulate 3D points via a differentiable bundle adjustment layer. We attain +state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, +and ETH3D.",cs.CV,"['cs.CV', 'cs.RO']" +Intrinsic Image Diffusion for Indoor Single-view Material Estimation,Peter Kocsis · Vincent Sitzmann · Matthias Nießner,https://peter-kocsis.github.io/IntrinsicImageDiffusion/,https://arxiv.org/abs/2312.12274,,2312.12274.pdf,Intrinsic Image Diffusion for Indoor Single-view Material Estimation,"We present Intrinsic Image Diffusion, a generative model for appearance +decomposition of indoor scenes. Given a single input view, we sample multiple +possible material explanations represented as albedo, roughness, and metallic +maps. Appearance decomposition poses a considerable challenge in computer +vision due to the inherent ambiguity between lighting and material properties +and the lack of real datasets. To address this issue, we advocate for a +probabilistic formulation, where instead of attempting to directly predict the +true material properties, we employ a conditional generative model to sample +from the solution space. Furthermore, we show that utilizing the strong learned +prior of recent diffusion models trained on large-scale real-world images can +be adapted to material estimation and highly improves the generalization to +real images. Our method produces significantly sharper, more consistent, and +more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on +PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the +effectiveness of our approach through experiments on both synthetic and +real-world datasets.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'I.4.8; I.2.10']" +An N-Point Linear Solver for Line and Motion Estimation with Event Cameras,Ling Gao · Daniel Gehrig · Hang Su · Davide Scaramuzza · Laurent Kneip,https://mgaoling.github.io/eventail/,https://arxiv.org/abs/2404.00842v1,,2404.00842v1.pdf,An N-Point Linear Solver for Line and Motion Estimation with Event Cameras,"Event cameras respond primarily to edges--formed by strong gradients--and are +thus particularly well-suited for line-based motion estimation. Recent work has +shown that events generated by a single line each satisfy a polynomial +constraint which describes a manifold in the space-time volume. Multiple such +constraints can be solved simultaneously to recover the partial linear velocity +and line parameters. In this work, we show that, with a suitable line +parametrization, this system of constraints is actually linear in the unknowns, +which allows us to design a novel linear solver. Unlike existing solvers, our +linear solver (i) is fast and numerically stable since it does not rely on +expensive root finding, (ii) can solve both minimal and overdetermined systems +with more than 5 events, and (iii) admits the characterization of all +degenerate cases and multiple solutions. The found line parameters are +singularity-free and have a fixed scale, which eliminates the need for +auxiliary constraints typically encountered in previous work. To recover the +full linear camera velocity we fuse observations from multiple lines with a +novel velocity averaging scheme that relies on a geometrically-motivated +residual, and thus solves the problem more efficiently than previous schemes +which minimize an algebraic residual. Extensive experiments in synthetic and +real-world settings demonstrate that our method surpasses the previous work in +numerical stability, and operates over 600 times faster.",cs.CV,['cs.CV'] +Benchmarking Segmentation Models with Mask-Preserved Attribute Editing,Zijin Yin · Kongming Liang · Bing Li · Zhanyu Ma · Jun Guo, ,https://arxiv.org/abs/2403.01231,,2403.01231.pdf,Benchmarking Segmentation Models with Mask-Preserved Attribute Editing,"When deploying segmentation models in practice, it is critical to evaluate +their behaviors in varied and complex scenes. Different from the previous +evaluation paradigms only in consideration of global attribute variations (e.g. +adverse weather), we investigate both local and global attribute variations for +robustness evaluation. To achieve this, we construct a mask-preserved attribute +editing pipeline to edit visual attributes of real images with precise control +of structural information. Therefore, the original segmentation labels can be +reused for the edited images. Using our pipeline, we construct a benchmark +covering both object and image attributes (e.g. color, material, pattern, +style). We evaluate a broad variety of semantic segmentation models, spanning +from conventional close-set models to recent open-vocabulary large models on +their robustness to different types of variations. We find that both local and +global attribute variations affect segmentation performances, and the +sensitivity of models diverges across different variation types. We argue that +local attributes have the same importance as global attributes, and should be +considered in the robustness evaluation of segmentation models. Code: +https://github.com/PRIS-CV/Pascal-EA.",cs.CV,['cs.CV'] +How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?,Yuxin Chen · Zongyang Ma · Ziqi Zhang · Zhongang Qi · Chunfeng Yuan · Bing Li · Junfu Pu · Ying Shan · Xiaojuan Qi · Weiming Hu, ,https://arxiv.org/abs/2310.19654,,2310.19654.pdf,MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval,"Due to the success of large-scale visual-language pretraining (VLP) models +and the widespread use of image-text retrieval in industry areas, it is now +critically necessary to reduce the model size and streamline their +mobile-device deployment. Single- and dual-stream model structures are commonly +used in image-text retrieval with the goal of closing the semantic gap between +textual and visual modalities. While single-stream models use deep feature +fusion to achieve more accurate cross-model alignment, dual-stream models are +better at offline indexing and fast inference.We propose a Multi-teacher +Cross-modality Alignment Distillation (MCAD) technique to integrate the +advantages of single- and dual-stream models. By incorporating the fused +single-stream features into the image and text features of the dual-stream +model, we formulate new modified teacher similarity distributions and features. +Then, we conduct both distribution and feature distillation to boost the +capability of the student dual-stream model, achieving high retrieval +performance without increasing inference complexity.Extensive experiments +demonstrate the remarkable performance and high efficiency of MCAD on +image-text retrieval tasks. Furthermore, we implement a lightweight CLIP model +on Snapdragon/Dimensity chips with only $\sim$100M running memory and +$\sim$8.0ms search latency, achieving the mobile-device application of VLP +models.",cs.CV,"['cs.CV', 'cs.AI']" +A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals,Jiangnan Tang · Jingya Wang · Kaiyang Ji · Lan Xu · Jingyi Yu · Ye Shi, ,https://arxiv.org/abs/2404.04890,,2404.04890.pdf,A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals,"Estimating full-body human motion via sparse tracking signals from +head-mounted displays and hand controllers in 3D scenes is crucial to +applications in AR/VR. One of the biggest challenges to this task is the +one-to-many mapping from sparse observations to dense full-body motions, which +endowed inherent ambiguities. To help resolve this ambiguous problem, we +introduce a new framework to combine rich contextual information provided by +scenes to benefit full-body motion tracking from sparse observations. To +estimate plausible human motions given sparse tracking signals and 3D scenes, +we develop $\text{S}^2$Fusion, a unified framework fusing \underline{S}cene and +sparse \underline{S}ignals with a conditional dif\underline{Fusion} model. +$\text{S}^2$Fusion first extracts the spatial-temporal relations residing in +the sparse signals via a periodic autoencoder, and then produces time-alignment +feature embedding as additional inputs. Subsequently, by drawing initial noisy +motion from a pre-trained prior, $\text{S}^2$Fusion utilizes conditional +diffusion to fuse scene geometry and sparse tracking signals to generate +full-body scene-aware motions. The sampling procedure of $\text{S}^2$Fusion is +further guided by a specially designed scene-penetration loss and +phase-matching loss, which effectively regularizes the motion of the lower body +even in the absence of any tracking signals, making the generated motion much +more plausible and coherent. Extensive experimental results have demonstrated +that our $\text{S}^2$Fusion outperforms the state-of-the-art in terms of +estimation quality and smoothness.",cs.CV,['cs.CV'] +Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction,Ziyi Yang · Xinyu Gao · Wen Zhou · Shaohui Jiao · Yuqing Zhang · Xiaogang Jin, ,https://arxiv.org/abs/2309.13101,,2309.13101.pdf,Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction,"Implicit neural representation has paved the way for new approaches to +dynamic scene reconstruction and rendering. Nonetheless, cutting-edge dynamic +neural rendering methods rely heavily on these implicit representations, which +frequently struggle to capture the intricate details of objects in the scene. +Furthermore, implicit methods have difficulty achieving real-time rendering in +general dynamic scenes, limiting their use in a variety of tasks. To address +the issues, we propose a deformable 3D Gaussians Splatting method that +reconstructs scenes using 3D Gaussians and learns them in canonical space with +a deformation field to model monocular dynamic scenes. We also introduce an +annealing smoothing training mechanism with no extra overhead, which can +mitigate the impact of inaccurate poses on the smoothness of time interpolation +tasks in real-world datasets. Through a differential Gaussian rasterizer, the +deformable 3D Gaussians not only achieve higher rendering quality but also +real-time rendering speed. Experiments show that our method outperforms +existing methods significantly in terms of both rendering quality and speed, +making it well-suited for tasks such as novel-view synthesis, time +interpolation, and real-time rendering.",cs.CV,['cs.CV'] +Towards Generalizable Tumor Synthesis,Qi Chen · Xiaoxi Chen · Haorui Song · Alan L. Yuille · Zhiwei Xiong · Chen Wei · Zongwei Zhou, ,https://arxiv.org/abs/2402.19470,,2402.19470.pdf,Towards Generalizable Tumor Synthesis,"Tumor synthesis enables the creation of artificial tumors in medical images, +facilitating the training of AI models for tumor detection and segmentation. +However, success in tumor synthesis hinges on creating visually realistic +tumors that are generalizable across multiple organs and, furthermore, the +resulting AI models being capable of detecting real tumors in images sourced +from different domains (e.g., hospitals). This paper made a progressive stride +toward generalizable tumor synthesis by leveraging a critical observation: +early-stage tumors (< 2cm) tend to have similar imaging characteristics in +computed tomography (CT), whether they originate in the liver, pancreas, or +kidneys. We have ascertained that generative AI models, e.g., Diffusion Models, +can create realistic tumors generalized to a range of organs even when trained +on a limited number of tumor examples from only one organ. Moreover, we have +shown that AI models trained on these synthetic tumors can be generalized to +detect and segment real tumors from CT volumes, encompassing a broad spectrum +of patient demographics, imaging protocols, and healthcare facilities.",eess.IV,"['eess.IV', 'cs.CV']" +Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection,Xiaohong Zhang · Huisheng Ye · Jingwen Li · Qinyu Tang · Yuanqi Li · Yanwen Guo · Jie Guo,https://huishengye.github.io/prompt3d/,https://arxiv.org/abs/2312.07530,,2312.07530.pdf,Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance,"Weakly supervised 3D object detection aims to learn a 3D detector with lower +annotation cost, e.g., 2D labels. Unlike prior work which still relies on few +accurate 3D annotations, we propose a framework to study how to leverage +constraints between 2D and 3D domains without requiring any 3D labels. +Specifically, we employ visual data from three perspectives to establish +connections between 2D and 3D domains. First, we design a feature-level +constraint to align LiDAR and image features based on object-aware regions. +Second, the output-level constraint is developed to enforce the overlap between +2D and projected 3D box estimations. Finally, the training-level constraint is +utilized by producing accurate and consistent 3D pseudo-labels that align with +the visual data. We conduct extensive experiments on the KITTI dataset to +validate the effectiveness of the proposed three constraints. Without using any +3D labels, our method achieves favorable performance against state-of-the-art +approaches and is competitive with the method that uses 500-frame 3D +annotations. Code and models will be made publicly available at +https://github.com/kuanchihhuang/VG-W3D.",cs.CV,['cs.CV'] +Discriminative Pattern Calibration Mechanism for Source-Free Domain Adaptation,Haifeng Xia · Siyu Xia · Zhengming Ding, ,https://arxiv.org/abs/2405.02954,,2405.02954.pdf,Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-Training,"Source-free domain adaptation (SFDA) aims to adapt a source model trained on +a fully-labeled source domain to a related but unlabeled target domain. While +the source model is a key avenue for acquiring target pseudolabels, the +generated pseudolabels may exhibit source bias. In the conventional SFDA +pipeline, a large data (e.g. ImageNet) pre-trained feature extractor is used to +initialize the source model at the start of source training, and subsequently +discarded. Despite having diverse features important for generalization, the +pre-trained feature extractor can overfit to the source data distribution +during source training and forget relevant target domain knowledge. Rather than +discarding this valuable knowledge, we introduce an integrated framework to +incorporate pre-trained networks into the target adaptation process. The +proposed framework is flexible and allows us to plug modern pre-trained +networks into the adaptation process to leverage their stronger representation +learning capabilities. For adaptation, we propose the Co-learn algorithm to +improve target pseudolabel quality collaboratively through the source model and +a pre-trained feature extractor. Building on the recent success of the +vision-language model CLIP in zero-shot image recognition, we present an +extension Co-learn++ to further incorporate CLIP's zero-shot classification +decisions. We evaluate on 3 benchmark datasets and include more challenging +scenarios such as open-set, partial-set and open-partial SFDA. Experimental +results demonstrate that our proposed strategy improves adaptation performance +and can be successfully integrated with existing SFDA methods.",cs.CV,"['cs.CV', 'cs.LG']" +Reconstruction-free Cascaded Adaptive Compressive Sensing,Chenxi Qiu · Tao Yue · Xuemei Hu, ,https://arxiv.org/abs/2403.17006,,2403.17006.pdf,Invertible Diffusion Models for Compressed Sensing,"While deep neural networks (NN) significantly advance image compressed +sensing (CS) by improving reconstruction quality, the necessity of training +current CS NNs from scratch constrains their effectiveness and hampers rapid +deployment. Although recent methods utilize pre-trained diffusion models for +image reconstruction, they struggle with slow inference and restricted +adaptability to CS. To tackle these challenges, this paper proposes Invertible +Diffusion Models (IDM), a novel efficient, end-to-end diffusion-based CS +method. IDM repurposes a large-scale diffusion sampling process as a +reconstruction model, and finetunes it end-to-end to recover original images +directly from CS measurements, moving beyond the traditional paradigm of +one-step noise estimation learning. To enable such memory-intensive end-to-end +finetuning, we propose a novel two-level invertible design to transform both +(1) the multi-step sampling process and (2) the noise estimation U-Net in each +step into invertible networks. As a result, most intermediate features are +cleared during training to reduce up to 93.8% GPU memory. In addition, we +develop a set of lightweight modules to inject measurements into noise +estimator to further facilitate reconstruction. Experiments demonstrate that +IDM outperforms existing state-of-the-art CS networks by up to 2.64dB in PSNR. +Compared to the recent diffusion model-based approach DDNM, our IDM achieves up +to 10.09dB PSNR gain and 14.54 times faster inference.",cs.CV,['cs.CV'] +CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention,Mohammad Sadil Khan · Elona Dupont · Sk Aziz Ali · Kseniya Cherenkova · Anis Kacem · Djamila Aouada,https://cvi2.uni.lu/cadsig-net/,https://arxiv.org/abs/2402.17678,,2402.17678.pdf,CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention,"Reverse engineering in the realm of Computer-Aided Design (CAD) has been a +longstanding aspiration, though not yet entirely realized. Its primary aim is +to uncover the CAD process behind a physical object given its 3D scan. We +propose CAD-SIGNet, an end-to-end trainable and auto-regressive architecture to +recover the design history of a CAD model represented as a sequence of +sketch-and-extrusion from an input point cloud. Our model learns +visual-language representations by layer-wise cross-attention between point +cloud and CAD language embedding. In particular, a new Sketch instance Guided +Attention (SGA) module is proposed in order to reconstruct the fine-grained +details of the sketches. Thanks to its auto-regressive nature, CAD-SIGNet not +only reconstructs a unique full design history of the corresponding CAD model +given an input point cloud but also provides multiple plausible design choices. +This allows for an interactive reverse engineering scenario by providing +designers with multiple next-step choices along with the design process. +Extensive experiments on publicly available CAD datasets showcase the +effectiveness of our approach against existing baseline models in two settings, +namely, full design history recovery and conditional auto-completion from point +clouds.",cs.CV,['cs.CV'] +SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation,Junyan Ye · Qiyan Luo · Jinhua Yu · Huaping Zhong · Zhimeng Zheng · Conghui He · Weijia Li, ,https://arxiv.org/abs/2404.02638,,2404.02638.pdf,SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation,"This paper aims at achieving fine-grained building attribute segmentation in +a cross-view scenario, i.e., using satellite and street-view image pairs. The +main challenge lies in overcoming the significant perspective differences +between street views and satellite views. In this work, we introduce SG-BEV, a +novel approach for satellite-guided BEV fusion for cross-view semantic +segmentation. To overcome the limitations of existing cross-view projection +methods in capturing the complete building facade features, we innovatively +incorporate Bird's Eye View (BEV) method to establish a spatially explicit +mapping of street-view features. Moreover, we fully leverage the advantages of +multiple perspectives by introducing a novel satellite-guided reprojection +module, optimizing the uneven feature distribution issues associated with +traditional BEV methods. Our method demonstrates significant improvements on +four cross-view datasets collected from multiple cities, including New York, +San Francisco, and Boston. On average across these datasets, our method +achieves an increase in mIOU by 10.13% and 5.21% compared with the +state-of-the-art satellite-based and cross-view methods. The code and datasets +of this work will be released at https://github.com/yejy53/SG-BEV.",cs.CV,['cs.CV'] +MSU-4S - The Michigan State University Four Seasons Dataset,Daniel Kent · Mohammed Alyaqoub · Xiaohu Lu · Sayed Khatounabadi · Kookjin Sung · Cole Scheller · Alexander Dalat · Xinwei Guo · Asma Bin Thabit · Roberto Muntaner Whitley · Hayder Radha, ,,https://msuspartans.com/news/2024/5/1/womens-basketball-fralick-adds-four-to-womens-basketball-roster.aspx?print=true,,,,,nan +Retraining-free Model Quantization via One-Shot Weight-Coupling Learning,Chen Tang · Yuan Meng · Jiacheng Jiang · Shuzhao Xie · Rongwei Lu · Xinzhu Ma · Zhi Wang · Wenwu Zhu, ,https://arxiv.org/abs/2401.01543,,2401.01543.pdf,Retraining-free Model Quantization via One-Shot Weight-Coupling Learning,"Quantization is of significance for compressing the over-parameterized deep +neural models and deploying them on resource-limited devices. Fixed-precision +quantization suffers from performance drop due to the limited numerical +representation ability. Conversely, mixed-precision quantization (MPQ) is +advocated to compress the model effectively by allocating heterogeneous +bit-width for layers. MPQ is typically organized into a searching-retraining +two-stage process. Previous works only focus on determining the optimal +bit-width configuration in the first stage efficiently, while ignoring the +considerable time costs in the second stage. However, retraining always +consumes hundreds of GPU-hours on the cutting-edge GPUs, thus hindering +deployment efficiency significantly. In this paper, we devise a one-shot +training-searching paradigm for mixed-precision model compression. +Specifically, in the first stage, all potential bit-width configurations are +coupled and thus optimized simultaneously within a set of shared weights. +However, our observations reveal a previously unseen and severe bit-width +interference phenomenon among highly coupled weights during optimization, +leading to considerable performance degradation under a high compression ratio. +To tackle this problem, we first design a bit-width scheduler to dynamically +freeze the most turbulent bit-width of layers during training, to ensure the +rest bit-widths converged properly. Then, taking inspiration from information +theory, we present an information distortion mitigation technique to align the +behaviour of the bad-performing bit-widths to the well-performing ones.",cs.CV,['cs.CV'] +Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder,Jinseok Kim · Tae-Kyun Kim, ,https://arxiv.org/abs/2403.10255,,2403.10255.pdf,Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder,"Super-resolution (SR) and image generation are important tasks in computer +vision and are widely adopted in real-world applications. Most existing +methods, however, generate images only at fixed-scale magnification and suffer +from over-smoothing and artifacts. Additionally, they do not offer enough +diversity of output images nor image consistency at different scales. Most +relevant work applied Implicit Neural Representation (INR) to the denoising +diffusion model to obtain continuous-resolution yet diverse and high-quality SR +results. Since this model operates in the image space, the larger the +resolution of image is produced, the more memory and inference time is +required, and it also does not maintain scale-specific consistency. We propose +a novel pipeline that can super-resolve an input image or generate from a +random noise a novel image at arbitrary scales. The method consists of a +pretrained auto-encoder, a latent diffusion model, and an implicit neural +decoder, and their learning strategies. The proposed method adopts diffusion +processes in a latent space, thus efficient, yet aligned with output image +space decoded by MLPs at arbitrary scales. More specifically, our +arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling +from the pretrained auto-encoder, and Local Implicit Image Function (LIIF) in +series. The latent diffusion process is learnt by the denoising and the +alignment losses jointly. Errors in output images are backpropagated via the +fixed decoder, improving the quality of output images. In the extensive +experiments using multiple public benchmarks on the two tasks i.e. image +super-resolution and novel image generation at arbitrary scales, the proposed +method outperforms relevant methods in metrics of image quality, diversity and +scale consistency. It is significantly better than the relevant prior-art in +the inference speed and memory usage.",cs.CV,['cs.CV'] +Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation,Huyong Wang · Huisi Wu · Jing Qin, ,,https://bmcmedimaging.biomedcentral.com/articles/10.1186/s12880-023-01121-3,,,,,nan +PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment,Tianchen Deng · Guole Shen · Tong Qin · jianyu wang · Wentao Zhao · Jingchuan Wang · Danwei Wang · Weidong Chen, ,https://arxiv.org/abs/2312.09866,,2312.09866.pdf,PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment,"Neural implicit scene representations have recently shown encouraging results +in dense visual SLAM. However, existing methods produce low-quality scene +reconstruction and low-accuracy localization performance when scaling up to +large indoor scenes and long sequences. These limitations are mainly due to +their single, global radiance field with finite capacity, which does not adapt +to large scenarios. Their end-to-end pose networks are also not robust enough +with the growth of cumulative errors in large scenes. To this end, we introduce +PLGSLAM, a neural visual SLAM system capable of high-fidelity surface +reconstruction and robust camera tracking in real-time. To handle large-scale +indoor scenes, PLGSLAM proposes a progressive scene representation method which +dynamically allocates new local scene representation trained with frames within +a local sliding window. This allows us to scale up to larger indoor scenes and +improves robustness (even under pose drifts). In local scene representation, +PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer +perceptron (MLP) networks for the low-frequency feature, achieving smoothness +and scene completion in unobserved areas. Moreover, we propose local-to-global +bundle adjustment method with a global keyframe database to address the +increased pose drifts on long sequences. Experimental results demonstrate that +PLGSLAM achieves state-of-the-art scene reconstruction results and tracking +performance across various datasets and scenarios (both in small and +large-scale indoor environments).",cs.CV,['cs.CV'] +Bayesian Exploration of Pre-trained Models for Low-shot Image Classification,Yibo Miao · Yu lei · Feng Zhou · Zhijie Deng, ,https://arxiv.org/abs/2404.00312,,2404.00312.pdf,Bayesian Exploration of Pre-trained Models for Low-shot Image Classification,"Low-shot image classification is a fundamental task in computer vision, and +the emergence of large-scale vision-language models such as CLIP has greatly +advanced the forefront of research in this field. However, most existing +CLIP-based methods lack the flexibility to effectively incorporate other +pre-trained models that encompass knowledge distinct from CLIP. To bridge the +gap, this work proposes a simple and effective probabilistic model ensemble +framework based on Gaussian processes, which have previously demonstrated +remarkable efficacy in processing small data. We achieve the integration of +prior knowledge by specifying the mean function with CLIP and the kernel +function with an ensemble of deep kernels built upon various pre-trained +models. By regressing the classification label directly, our framework enables +analytical inference, straightforward uncertainty quantification, and +principled hyper-parameter tuning. Through extensive experiments on standard +benchmarks, we demonstrate that our method consistently outperforms competitive +ensemble baselines regarding predictive performance. Additionally, we assess +the robustness of our method and the quality of the yielded uncertainty +estimates on out-of-distribution datasets. We also illustrate that our method, +despite relying on label regression, still enjoys superior model calibration +compared to most deterministic baselines.",cs.CV,"['cs.CV', 'cs.AI']" +What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models,Letian Zhang · Xiaotong Zhai · Zhongkai Zhao · Yongshuo Zong · Xin Wen · Bingchen Zhao, ,https://arxiv.org/abs/2310.06627,,2310.06627.pdf,What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models,"Counterfactual reasoning, a fundamental aspect of human cognition, involves +contemplating alternatives to established facts or past events, significantly +enhancing our abilities in planning and decision-making. In light of the +advancements in current multi-modal large language models, we explore their +effectiveness in counterfactual reasoning. To facilitate this investigation, we +introduce a novel dataset, C-VQA, specifically designed to test the +counterfactual reasoning capabilities of modern multi-modal large language +models. This dataset is constructed by infusing original questions with +counterfactual presuppositions, spanning various types such as numerical and +boolean queries. It encompasses a mix of real and synthetic data, representing +a wide range of difficulty levels. Our thorough evaluations of contemporary +vision-language models using this dataset have revealed substantial performance +drops, with some models showing up to a 40% decrease, highlighting a +significant gap between current models and human-like vision reasoning +capabilities. We hope our dataset will serve as a vital benchmark for +evaluating the counterfactual reasoning capabilities of models. Code and +dataset are publicly available at https://bzhao.me/C-VQA/.",cs.CL,"['cs.CL', 'cs.CV', 'cs.LG']" +Vision-and-Language Navigation via Causal Learning,Liuyi Wang · Zongtao He · Ronghao Dang · mengjiao shen · Chengju Liu · Qijun Chen, ,https://arxiv.org/abs/2404.10241,,2404.10241.pdf,Vision-and-Language Navigation via Causal Learning,"In the pursuit of robust and generalizable environment perception and +language understanding, the ubiquitous challenge of dataset bias continues to +plague vision-and-language navigation (VLN) agents, hindering their performance +in unseen environments. This paper introduces the generalized cross-modal +causal transformer (GOAT), a pioneering solution rooted in the paradigm of +causal inference. By delving into both observable and unobservable confounders +within vision, language, and history, we propose the back-door and front-door +adjustment causal learning (BACL and FACL) modules to promote unbiased learning +by comprehensively mitigating potential spurious correlations. Additionally, to +capture global confounder features, we propose a cross-modal feature pooling +(CFP) module supervised by contrastive learning, which is also shown to be +effective in improving cross-modal representations during pre-training. +Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and +SOON) underscore the superiority of our proposed method over previous +state-of-the-art approaches. Code is available at +https://github.com/CrystalSixone/VLN-GOAT.",cs.CV,"['cs.CV', 'cs.AI']" +TIM: A Time Interval Machine for Audio-Visual Action Recognition,Jacob Chalk · Jaesung Huh · Evangelos Kazakos · Andrew Zisserman · Dima Damen,https://jacobchalk.github.io/TIM-Project/,https://arxiv.org/abs/2404.05559,,2404.05559.pdf,TIM: A Time Interval Machine for Audio-Visual Action Recognition,"Diverse actions give rise to rich audio-visual signals in long videos. Recent +works showcase that the two modalities of audio and video exhibit different +temporal extents of events and distinct labels. We address the interplay +between the two modalities in long videos by explicitly modelling the temporal +extents of audio and visual events. We propose the Time Interval Machine (TIM) +where a modality-specific time interval poses as a query to a transformer +encoder that ingests a long video input. The encoder then attends to the +specified interval, as well as the surrounding context in both modalities, in +order to recognise the ongoing action. + We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, +Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On +EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly +larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we +show that TIM can be adapted for action detection, using dense multi-scale +interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and +showing strong performance on the Perception Test. Our ablations show the +critical role of integrating the two modalities and modelling their time +intervals in achieving this performance. Code and models at: +https://github.com/JacobChalk/TIM",cs.CV,['cs.CV'] +Retrieval-Augmented Open-Vocabulary Object Detection,Jooyeon Kim · Eulrang Cho · Sehyung Kim · Hyunwoo J. Kim, ,https://arxiv.org/abs/2404.05687,,2404.05687.pdf,Retrieval-Augmented Open-Vocabulary Object Detection,"Open-vocabulary object detection (OVD) has been studied with Vision-Language +Models (VLMs) to detect novel objects beyond the pre-trained categories. +Previous approaches improve the generalization ability to expand the knowledge +of the detector, using 'positive' pseudo-labels with additional 'class' names, +e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, +we propose Retrieval-Augmented Losses and visual Features (RALF). Our method +retrieves related 'negative' classes and augments loss functions. Also, visual +features are augmented with 'verbalized concepts' of classes, e.g., worn on the +feet, handheld music player, and sharp teeth. Specifically, RALF consists of +two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual +Features (RAF). RAL constitutes two losses reflecting the semantic similarity +with negative vocabularies. In addition, RAF augments visual features with the +verbalized concepts from a large language model (LLM). Our experiments +demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We +achieve improvement up to 3.4 box AP$_{50}^{\text{N}}$ on novel categories of +the COCO dataset and 3.6 mask AP$_{\text{r}}$ gains on the LVIS dataset. Code +is available at https://github.com/mlvlab/RALF .",cs.CV,['cs.CV'] +Continual Motion Prediction Learning Framework via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy,Dae Jun Kang · Dongsuk Kum · Sanmin Kim, ,https://arxiv.org/html/2311.11908v3,,2311.11908v3.pdf,Continual Learning: Applications and the Road Forward,"Continual learning is a subfield of machine learning, which aims to allow +machine learning models to continuously learn on new data, by accumulating +knowledge without forgetting what was learned in the past. In this work, we +take a step back, and ask: ""Why should one care about continual learning in the +first place?"". We set the stage by examining recent continual learning papers +published at four major machine learning conferences, and show that +memory-constrained settings dominate the field. Then, we discuss five open +problems in machine learning, and even though they might seem unrelated to +continual learning at first sight, we show that continual learning will +inevitably be part of their solution. These problems are model editing, +personalization and specialization, on-device learning, faster (re-)training +and reinforcement learning. Finally, by comparing the desiderata from these +unsolved problems and the current assumptions in continual learning, we +highlight and discuss four future directions for continual learning research. +We hope that this work offers an interesting perspective on the future of +continual learning, while displaying its potential value and the paths we have +to pursue in order to make it successful. This work is the result of the many +discussions the authors had at the Dagstuhl seminar on Deep Continual Learning, +in March 2023.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +Learning Visual Prompt for Gait Recognition,Kang Ma · Ying Fu · Chunshui Cao · Saihui Hou · Yongzhen Huang · Dezhi Zheng, ,https://arxiv.org/abs/2402.19122,,2402.19122.pdf,BigGait: Learning Gait Representation You Want by Large Vision Models,"Gait recognition stands as one of the most pivotal remote identification +technologies and progressively expands across research and industry +communities. However, existing gait recognition methods heavily rely on +task-specific upstream driven by supervised learning to provide explicit gait +representations like silhouette sequences, which inevitably introduce expensive +annotation costs and potential error accumulation. Escaping from this trend, +this work explores effective gait representations based on the all-purpose +knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a +simple yet efficient gait framework, termed BigGait. Specifically, the Gait +Representation Extractor (GRE) within BigGait draws upon design principles from +established gait representations, effectively transforming all-purpose +knowledge into implicit gait representations without requiring third-party +supervision signals. Experiments on CCPG, CAISA-B* and SUSTech1K indicate that +BigGait significantly outperforms the previous methods in both within-domain +and cross-domain tasks in most cases, and provides a more practical paradigm +for learning the next-generation gait representation. Finally, we delve into +prospective challenges and promising directions in LVMs-based gait recognition, +aiming to inspire future work in this emerging topic. The source code is +available at https://github.com/ShiqiYu/OpenGait.",cs.CV,['cs.CV'] +Zero-Reference Low-Light Enhancement via Physical Quadruple Priors,Wenjing Wang · Huan Yang · Jianlong Fu · Jiaying Liu,https://daooshee.github.io/QuadPrior-Website/,https://arxiv.org/abs/2403.12933,,2403.12933.pdf,Zero-Reference Low-Light Enhancement via Physical Quadruple Priors,"Understanding illumination and reducing the need for supervision pose a +significant challenge in low-light enhancement. Current approaches are highly +sensitive to data usage during training and illumination-specific +hyper-parameters, limiting their ability to handle unseen scenarios. In this +paper, we propose a new zero-reference low-light enhancement framework +trainable solely with normal light images. To accomplish this, we devise an +illumination-invariant prior inspired by the theory of physical light transfer. +This prior serves as the bridge between normal and low-light images. Then, we +develop a prior-to-image framework trained without low-light data. During +testing, this framework is able to restore our illumination-invariant prior +back to images, automatically achieving low-light enhancement. Within this +framework, we leverage a pretrained generative diffusion model for model +ability, introduce a bypass decoder to handle detail distortion, as well as +offer a lightweight version for practicality. Extensive experiments demonstrate +our framework's superiority in various scenarios as well as good +interpretability, robustness, and efficiency. Code is available on our project +homepage: http://daooshee.github.io/QuadPrior-Website/",cs.CV,['cs.CV'] +Differentiable Information Bottleneck for Deterministic Multi-view Clustering,Xiaoqiang Yan · Zhixiang Jin · Fengshou Han · Yangdong Ye, ,https://arxiv.org/abs/2403.15681,,2403.15681.pdf,Differentiable Information Bottleneck for Deterministic Multi-view Clustering,"In recent several years, the information bottleneck (IB) principle provides +an information-theoretic framework for deep multi-view clustering (MVC) by +compressing multi-view observations while preserving the relevant information +of multiple views. Although existing IB-based deep MVC methods have achieved +huge success, they rely on variational approximation and distribution +assumption to estimate the lower bound of mutual information, which is a +notoriously hard and impractical problem in high-dimensional multi-view spaces. +In this work, we propose a new differentiable information bottleneck (DIB) +method, which provides a deterministic and analytical MVC solution by fitting +the mutual information without the necessity of variational approximation. +Specifically, we first propose to directly fit the mutual information of +high-dimensional spaces by leveraging normalized kernel Gram matrix, which does +not require any auxiliary neural estimator to estimate the lower bound of +mutual information. Then, based on the new mutual information measurement, a +deterministic multi-view neural network with analytical gradients is explicitly +trained to parameterize IB principle, which derives a deterministic compression +of input variables from different views. Finally, a triplet consistency +discovery mechanism is devised, which is capable of mining the feature +consistency, cluster consistency and joint consistency based on the +deterministic and compact representations. Extensive experimental results show +the superiority of our DIB method on 6 benchmarks compared with 13 +state-of-the-art baselines.",cs.IT,"['cs.IT', 'cs.LG', 'math.IT']" +Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction,Zilin Du · Haoxin Li · Xu Guo · Boyang Li, ,https://arxiv.org/abs/2312.03025,,2312.03025.pdf,Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction,"The task of multimodal relation extraction has attracted significant research +attention, but progress is constrained by the scarcity of available training +data. One natural thought is to extend existing datasets with cross-modal +generative models. In this paper, we consider a novel problem setting, where +only unimodal data, either text or image, are available during training. We aim +to train a multimodal classifier from synthetic data that perform well on real +multimodal test data. However, training with synthetic data suffers from two +obstacles: lack of data diversity and label information loss. To alleviate the +issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta +GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to +promote diversity in the generated data and exploits a teacher network to +select valuable training samples with high mutual information with the +ground-truth labels. Comparing our method to direct training on synthetic data, +we observed a significant improvement of 24.06% F1 with synthetic text and +26.42% F1 with synthetic images. Notably, our best model trained on completely +synthetic images outperforms prior state-of-the-art models trained on real +multimodal data by a margin of 3.76% in F1. Our codebase will be made available +upon acceptance.",cs.AI,"['cs.AI', 'cs.CL', 'cs.CV', 'cs.LG']" +DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data,Hanrong Ye · Dan Xu, ,https://arxiv.org/abs/2403.15389,,2403.15389.pdf,DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data,"Recently, there has been an increased interest in the practical problem of +learning multiple dense scene understanding tasks from partially annotated +data, where each training sample is only labeled for a subset of the tasks. The +missing of task labels in training leads to low-quality and noisy predictions, +as can be observed from state-of-the-art methods. To tackle this issue, we +reformulate the partially-labeled multi-task dense prediction as a pixel-level +denoising problem, and propose a novel multi-task denoising diffusion framework +coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to +model a potential noisy distribution in the task prediction or feature maps and +generate rectified outputs for different tasks. To exploit multi-task +consistency in denoising, we further introduce a Multi-Task Conditioning +strategy, which can implicitly utilize the complementary nature of the tasks to +help learn the unlabeled tasks, leading to an improvement in the denoising +performance of the different tasks. Extensive quantitative and qualitative +experiments demonstrate that the proposed multi-task denoising diffusion model +can significantly improve multi-task prediction maps, and outperform the +state-of-the-art methods on three challenging multi-task benchmarks, under two +different partial-labeling evaluation settings. The code is available at +https://prismformore.github.io/diffusionmtl/.",cs.CV,"['cs.CV', 'cs.LG']" +Retrieval-Augmented Embodied Agents,Yichen Zhu · Zhicai Ou · Xiaofeng Mou · Jian Tang, ,https://arxiv.org/abs/2404.11699,,2404.11699.pdf,Retrieval-Augmented Embodied Agents,"Embodied agents operating in complex and uncertain environments face +considerable challenges. While some advanced agents handle complex manipulation +tasks with proficiency, their success often hinges on extensive training data +to develop their capabilities. In contrast, humans typically rely on recalling +past experiences and analogous situations to solve new problems. Aiming to +emulate this human approach in robotics, we introduce the Retrieval-Augmented +Embodied Agent (RAEA). This innovative system equips robots with a form of +shared memory, significantly enhancing their performance. Our approach +integrates a policy retriever, allowing robots to access relevant strategies +from an external policy memory bank based on multi-modal inputs. Additionally, +a policy generator is employed to assimilate these strategies into the learning +process, enabling robots to formulate effective responses to tasks. Extensive +testing of RAEA in both simulated and real-world scenarios demonstrates its +superior performance over traditional methods, representing a major leap +forward in robotic technology.",cs.RO,['cs.RO'] +Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models,Shengqu Cai · Duygu Ceylan · Matheus Gadelha · Chun-Hao P. Huang · Tuanfeng Y. Wang · Gordon Wetzstein,https://primecai.github.io/generative_rendering/,https://arxiv.org/abs/2312.01409,,2312.01409.pdf,Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models,"Traditional 3D content creation tools empower users to bring their +imagination to life by giving them direct control over a scene's geometry, +appearance, motion, and camera path. Creating computer-generated videos, +however, is a tedious manual process, which can be automated by emerging +text-to-video diffusion models. Despite great promise, video diffusion models +are difficult to control, hindering a user to apply their own creativity rather +than amplifying it. To address this challenge, we present a novel approach that +combines the controllability of dynamic 3D meshes with the expressivity and +editability of emerging diffusion models. For this purpose, our approach takes +an animated, low-fidelity rendered mesh as input and injects the ground truth +correspondence information obtained from the dynamic mesh into various stages +of a pre-trained text-to-image generation model to output high-quality and +temporally consistent frames. We demonstrate our approach on various examples +where motion can be obtained by animating rigged assets or changing the camera +path.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']" +Not All Classes Stand on Same Embeddings: Calibrating a Semantic Distance with Metric Tensor,Jae Hyeon Park · Gyoomin Lee · Seunggi Park · Sung In Cho, ,,https://stackoverflow.com/questions/76678783/langchains-chroma-vectordb-similarity-search-with-score-and-vectordb-simil,,,,,nan +OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation,Qidong Huang · Xiaoyi Dong · Pan Zhang · Bin Wang · Conghui He · Jiaqi Wang · Dahua Lin · Weiming Zhang · Nenghai Yu, ,https://arxiv.org/abs/2311.17911,,2311.17911.pdf,OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation,"Hallucination, posed as a pervasive challenge of multi-modal large language +models (MLLMs), has significantly impeded their real-world usage that demands +precise judgment. Existing methods mitigate this issue with either training +with specific designed data or inferencing with external knowledge from other +sources, incurring inevitable additional costs. In this paper, we present +OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a +Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate +the hallucination issue without additional data, knowledge, or training. Our +approach begins with an interesting observation that, most hallucinations are +closely tied to the knowledge aggregation patterns manifested in the +self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a +few summary tokens, but not all the previous tokens. Such partial over-trust +inclination results in the neglecting of image tokens and describes the image +content with hallucination. Based on the observation, OPERA introduces a +penalty term on the model logits during the beam-search decoding to mitigate +the over-trust issue, along with a rollback strategy that retrospects the +presence of summary tokens in the previously generated tokens, and re-allocate +the token selection if necessary. With extensive experiments, OPERA shows +significant hallucination-mitigating performance on different MLLMs and +metrics, proving its effectiveness and generality. Our code is available at: +https://github.com/shikiw/OPERA.",cs.CV,['cs.CV'] +Combining Frame and GOP Embeddings for Neural Video Representation,Jens Eirik Saethre · Roberto Azevedo · Christopher Schroers, ,https://arxiv.org/abs/2403.15679,,2403.15679.pdf,DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes,"Implicit neural representations for video (NeRV) have recently become a novel +way for high-quality video representation. However, existing works employ a +single network to represent the entire video, which implicitly confuse static +and dynamic information. This leads to an inability to effectively compress the +redundant static information and lack the explicitly modeling of global +temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV, +which decomposes videos into sparse learnable static codes and dynamic codes +without the need for explicit optical flow or residual supervision. By setting +different sampling rates for two codes and applying weighted sum and +interpolation sampling methods, DS-NeRV efficiently utilizes redundant static +information while maintaining high-frequency details. Additionally, we design a +cross-channel attention-based (CCA) fusion module to efficiently fuse these two +codes for frame decoding. Our approach achieves a high quality reconstruction +of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic +codes representation and outperforms existing NeRV methods in many downstream +tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.",cs.CV,"['cs.CV', 'cs.MM']" +FlowIE:Efficient Image Enhancement via Rectified Flow,Yixuan Zhu · Wenliang Zhao · Ao Li · Yansong Tang · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2405.14677,,2405.14677.pdf,RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance,"Customizing diffusion models to generate identity-preserving images from +user-provided reference images is an intriguing new problem. The prevalent +approaches typically require training on extensive domain-specific images to +achieve identity preservation, which lacks flexibility across different use +cases. To address this issue, we exploit classifier guidance, a training-free +technique that steers diffusion models using an existing classifier, for +personalized image generation. Our study shows that based on a recent rectified +flow framework, the major limitation of vanilla classifier guidance in +requiring a special classifier can be resolved with a simple fixed-point +solution, allowing flexible personalization with off-the-shelf image +discriminators. Moreover, its solving procedure proves to be stable when +anchored to a reference flow trajectory, with a convergence guarantee. The +derived method is implemented on rectified flow with different off-the-shelf +image discriminators, delivering advantageous personalization results for human +faces, live subjects, and certain objects. Code is available at +https://github.com/feifeiobama/RectifID.",cs.CV,"['cs.CV', 'cs.LG']" +CPR-Coach: Recognizing Composite Error Actions based on Single-class Training,Shunli Wang · Shuaibing Wang · Dingkang Yang · Mingcheng Li · Haopeng Kuang · Xiao Zhao · Liuzhen Su · Peng Zhai · Lihua Zhang, ,https://arxiv.org/abs/2309.11718,,2309.11718.pdf,CPR-Coach: Recognizing Composite Error Actions based on Single-class Training,"The fine-grained medical action analysis task has received considerable +attention from pattern recognition communities recently, but it faces the +problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is +an essential skill in emergency treatment. Currently, the assessment of CPR +skills mainly depends on dummies and trainers, leading to high training costs +and low efficiency. For the first time, this paper constructs a vision-based +system to complete error action recognition and skill assessment in CPR. +Specifically, we define 13 types of single-error actions and 74 types of +composite error actions during external cardiac compression and then develop a +video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark, this +paper thoroughly investigates and compares the performance of existing action +recognition models based on different data modalities. To solve the unavoidable +Single-class Training & Multi-class Testing problem, we propose a +humancognition-inspired framework named ImagineNet to improve the model's +multierror recognition performance under restricted supervision. Extensive +experiments verify the effectiveness of the framework. We hope this work could +advance research toward fine-grained medical action analysis and skill +assessment. The CPR-Coach dataset and the code of ImagineNet are publicly +available on Github.",cs.CV,"['cs.CV', 'I.5.4']" +Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion,Zixian Gao · Xun Jiang · Xing Xu · Fumin Shen · Yujie Li · Heng Tao Shen, ,https://arxiv.org/abs/2307.16121,,2307.16121.pdf,Uncertainty-Encoded Multi-Modal Fusion for Robust Object Detection in Autonomous Driving,"Multi-modal fusion has shown initial promising results for object detection +of autonomous driving perception. However, many existing fusion schemes do not +consider the quality of each fusion input and may suffer from adverse +conditions on one or more sensors. While predictive uncertainty has been +applied to characterize single-modal object detection performance at run time, +incorporating uncertainties into the multi-modal fusion still lacks effective +solutions due primarily to the uncertainty's cross-modal incomparability and +distinct sensitivities to various adverse conditions. To fill this gap, this +paper proposes Uncertainty-Encoded Mixture-of-Experts (UMoE) that explicitly +incorporates single-modal uncertainties into LiDAR-camera fusion. UMoE uses +individual expert network to process each sensor's detection result together +with encoded uncertainty. Then, the expert networks' outputs are analyzed by a +gating network to determine the fusion weights. The proposed UMoE module can be +integrated into any proposal fusion pipeline. Evaluation shows that UMoE +achieves a maximum of 10.67%, 3.17%, and 5.40% performance gain compared with +the state-of-the-art proposal-level multi-modal object detectors under extreme +weather, adversarial, and blinding attack scenarios.",cs.CV,"['cs.CV', 'cs.AI']" +Unsupervised Deep Unrolling Networks for Phase Unwrapping,Zhile Chen · Yuhui Quan · Hui Ji, ,,https://ieeexplore.ieee.org/document/10520881,,,,,nan +Transductive Zero-Shot $\&$ Few-Shot CLIP,Ségolène Martin · Yunshi HUANG · Fereshteh Shakeri · Jean-Christophe Pesquet · Ismail Ben Ayed, ,https://arxiv.org/abs/2405.18437,,2405.18437.pdf,Transductive Zero-Shot and Few-Shot CLIP,"Transductive inference has been widely investigated in few-shot image +classification, but completely overlooked in the recent, fast growing +literature on adapting vision-langage models like CLIP. This paper addresses +the transductive zero-shot and few-shot CLIP classification challenge, in which +inference is performed jointly across a mini-batch of unlabeled query samples, +rather than treating each instance independently. We initially construct +informative vision-text probability features, leading to a classification +problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our +optimization-based classification objective models the data probability +distribution for each class using a Dirichlet law. The minimization problem is +then tackled with a novel block Majorization-Minimization algorithm, which +simultaneously estimates the distribution parameters and class assignments. +Extensive numerical experiments on 11 datasets underscore the benefits and +efficacy of our batch inference approach.On zero-shot tasks with test batches +of 75 samples, our approach yields near 20% improvement in ImageNet accuracy +over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art +methods in the few-shot setting. The code is available at: +https://github.com/SegoleneMartin/transductive-CLIP.",cs.CV,"['cs.CV', 'cs.AI']" +Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living,Dominick Reilly · Srijan Das, ,https://arxiv.org/abs/2311.18840,,2311.18840.pdf,Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living,"Video transformers have become the de facto standard for human action +recognition, yet their exclusive reliance on the RGB modality still limits +their adoption in certain domains. One such domain is Activities of Daily +Living (ADL), where RGB alone is not sufficient to distinguish between visually +similar actions, or actions observed from multiple viewpoints. To facilitate +the adoption of video transformers for ADL, we hypothesize that the +augmentation of RGB with human pose information, known for its sensitivity to +fine-grained motion and multiple viewpoints, is essential. Consequently, we +introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a +novel approach that augments the RGB representations learned by video +transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are +two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction +Module, that are responsible for inducing 2D and 3D pose information into the +RGB representations. These modules operate by performing pose-aware auxiliary +tasks, a design choice that allows $\pi$-ViT to discard the modules during +inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on +three prominent ADL datasets, encompassing both real-world and large-scale +RGB-D datasets, without requiring poses or additional computational overhead at +inference.",cs.CV,['cs.CV'] +Specularity Factorization for Low Light Enhancement,Saurabh Saini · P. J. Narayanan, ,https://arxiv.org/abs/2404.01998,,2404.01998.pdf,Specularity Factorization for Low-Light Enhancement,"We present a new additive image factorization technique that treats images to +be composed of multiple latent specular components which can be simply +estimated recursively by modulating the sparsity during decomposition. Our +model-driven {\em RSFNet} estimates these factors by unrolling the optimization +into network layers requiring only a few scalars to be learned. The resultant +factors are interpretable by design and can be fused for different image +enhancement tasks via a network or combined directly by the user in a +controllable fashion. Based on RSFNet, we detail a zero-reference Low Light +Enhancement (LLE) application trained without paired or unpaired supervision. +Our system improves the state-of-the-art performance on standard benchmarks and +achieves better generalization on multiple other datasets. We also integrate +our factors with other task specific fusion networks for applications like +deraining, deblurring and dehazing with negligible overhead thereby +highlighting the multi-domain and multi-task generalizability of our proposed +RSFNet. The code and data is released for reproducibility on the project +homepage.",cs.CV,"['cs.CV', 'cs.LG']" +Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion.,Weijian Ma · Shuaiqi Chen · Yunzhong Lou · Xueyang Li · Xiangdong Zhou, ,https://arxiv.org/abs/2405.15188,,2405.15188.pdf,PS-CAD: Local Geometry Guidance via Prompting and Selection for CAD Reconstruction,"Reverse engineering CAD models from raw geometry is a classic but challenging +research problem. In particular, reconstructing the CAD modeling sequence from +point clouds provides great interpretability and convenience for editing. To +improve upon this problem, we introduce geometric guidance into the +reconstruction network. Our proposed model, PS-CAD, reconstructs the CAD +modeling sequence one step at a time. At each step, we provide two forms of +geometric guidance. First, we provide the geometry of surfaces where the +current reconstruction differs from the complete model as a point cloud. This +helps the framework to focus on regions that still need work. Second, we use +geometric analysis to extract a set of planar prompts, that correspond to +candidate surfaces where a CAD extrusion step could be started. Our framework +has three major components. Geometric guidance computation extracts the two +types of geometric guidance. Single-step reconstruction computes a single +candidate CAD modeling step for each provided prompt. Single-step selection +selects among the candidate CAD modeling steps. The process continues until the +reconstruction is completed. Our quantitative results show a significant +improvement across all metrics. For example, on the dataset DeepCAD, PS-CAD +improves upon the best published SOTA method by reducing the geometry errors +(CD and HD) by 10%, and the structural error (ECD metric) by about 15%.",cs.CV,['cs.CV'] +Logarithmic Lenses: Exploring Log RGB Data for Image Classification,Bruce Maxwell · Bruce Maxwell · Sumegha Singhania · Avnish Patel · Rahul Kumar · Heather Fryling · Sihan Li · Haonan Sun · Ping He · Zewen Li, ,,https://medium.com/@adjileyeb/unlocking-visual-insights-applying-the-logit-lens-to-image-data-with-vision-transformers-b99cb70dd704,,,,,nan +D$^4$M: Dataset Distillation via Disentangled Diffusion Model,Duo Su · Junjie Hou · Weizhi Gao · Yingjie Tian · Bowen Tang, ,https://arxiv.org/abs/2403.03881,,2403.03881.pdf,Latent Dataset Distillation with Diffusion Models,"The efficacy of machine learning has traditionally relied on the availability +of increasingly larger datasets. However, large datasets pose storage +challenges and contain non-influential samples, which could be ignored during +training without impacting the final accuracy of the model. In response to +these limitations, the concept of distilling the information on a dataset into +a condensed set of (synthetic) samples, namely a distilled dataset, emerged. +One crucial aspect is the selected architecture (usually ConvNet) for linking +the original and synthetic datasets. However, the final accuracy is lower if +the employed model architecture differs from the model used during +distillation. Another challenge is the generation of high-resolution images, +e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation +with Diffusion Models (LD3M) that combine diffusion in latent space with +dataset distillation to tackle both challenges. LD3M incorporates a novel +diffusion process tailored for dataset distillation, which improves the +gradient norms for learning synthetic images. By adjusting the number of +diffusion steps, LD3M also offers a straightforward way of controlling the +trade-off between speed and accuracy. We evaluate our approach in several +ImageNet subsets and for high-resolution images (128x128 and 256x256). As a +result, LD3M consistently outperforms state-of-the-art distillation techniques +by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Probing the 3D Awareness of Visual Foundation Models,Mohamed El Banani · Amit Raj · Kevis-kokitsi Maninis · Abhishek Kar · Yuanzhen Li · Michael Rubinstein · Deqing Sun · Leonidas Guibas · Justin Johnson · Varun Jampani, ,https://arxiv.org/abs/2404.08636,,2404.08636.pdf,Probing the 3D Awareness of Visual Foundation Models,"Recent advances in large-scale pretraining have yielded visual foundation +models with strong capabilities. Not only can recent models generalize to +arbitrary images for their training task, their intermediate representations +are useful for other visual tasks such as detection and segmentation. Given +that such models can classify, delineate, and localize objects in 2D, we ask +whether they also represent their 3D structure? In this work, we analyze the 3D +awareness of visual foundation models. We posit that 3D awareness implies that +representations (1) encode the 3D structure of the scene and (2) consistently +represent the surface across views. We conduct a series of experiments using +task-specific probes and zero-shot inference procedures on frozen features. Our +experiments reveal several limitations of the current models. Our code and +analysis can be found at https://github.com/mbanani/probe3d.",cs.CV,['cs.CV'] +HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models,Nataniel Ruiz · Yuanzhen Li · Varun Jampani · Wei Wei · Tingbo Hou · Yael Pritch · Neal Wadhwa · Michael Rubinstein · Kfir Aberman, ,https://arxiv.org/abs/2307.06949,,2307.06949.pdf,HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models,"Personalization has emerged as a prominent aspect within the field of +generative AI, enabling the synthesis of individuals in diverse contexts and +styles, while retaining high-fidelity to their identities. However, the process +of personalization presents inherent challenges in terms of time and memory +requirements. Fine-tuning each personalized model needs considerable GPU time +investment, and storing a personalized model per subject can be demanding in +terms of storage capacity. To overcome these challenges, we propose +HyperDreamBooth-a hypernetwork capable of efficiently generating a small set of +personalized weights from a single image of a person. By composing these +weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth +can generate a person's face in various contexts and styles, with high subject +details while also preserving the model's crucial knowledge of diverse styles +and semantic modifications. Our method achieves personalization on faces in +roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual +Inversion, using as few as one reference image, with the same quality and style +diversity as DreamBooth. Also our method yields a model that is 10000x smaller +than a normal DreamBooth model. Project page: https://hyperdreambooth.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']" +VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams,Liao Wang · Kaixin Yao · Chengcheng Guo · Zhirui Zhang · Qiang Hu · Jingyi Yu · Lan Xu · Minye Wu, ,https://arxiv.org/abs/2312.01407,,2312.01407.pdf,VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams,"Neural Radiance Fields (NeRFs) excel in photorealistically rendering static +scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous +devices remains challenging, due to data storage and computational constraints. +In this paper, we introduce VideoRF, the first approach to enable real-time +streaming and rendering of dynamic radiance fields on mobile platforms. At the +core is a serialized 2D feature image stream representing the 4D radiance field +all in one. We introduce a tailored training scheme directly applied to this 2D +domain to impose the temporal and spatial redundancy of the feature image +stream. By leveraging the redundancy, we show that the feature image stream can +be efficiently compressed by 2D video codecs, which allows us to exploit video +hardware accelerators to achieve real-time decoding. On the other hand, based +on the feature image stream, we propose a novel rendering pipeline for VideoRF, +which has specialized space mappings to query radiance properties efficiently. +Paired with a deferred shading model, VideoRF has the capability of real-time +rendering on mobile devices thanks to its efficiency. We have developed a +real-time interactive player that enables online streaming and rendering of +dynamic scenes, offering a seamless and immersive free-viewpoint experience +across a range of devices, from desktops to mobile phones.",cs.CV,['cs.CV'] +GLaMM: Pixel Grounding Large Multimodal Model,Hanoona Rasheed · Muhammad Maaz · Sahal Shaji Mullappilly · Abdelrahman Shaker · Salman Khan · Hisham Cholakkal · Rao Anwer · Eric P. Xing · Ming-Hsuan Yang · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2311.03356v1,,2311.03356v1.pdf,GLaMM: Pixel Grounding Large Multimodal Model,"Large Multimodal Models (LMMs) extend Large Language Models to the vision +domain. Initial efforts towards LMMs used holistic images and text prompts to +generate ungrounded textual responses. Very recently, region-level LMMs have +been used to generate visually grounded responses. However, they are limited to +only referring a single object category at a time, require users to specify the +regions in inputs, or cannot offer dense pixel-wise object grounding. In this +work, we present Grounding LMM (GLaMM), the first model that can generate +natural language responses seamlessly intertwined with corresponding object +segmentation masks. GLaMM not only grounds objects appearing in the +conversations but is flexible enough to accept both textual and optional visual +prompts (region of interest) as input. This empowers users to interact with the +model at various levels of granularity, both in textual and visual domains. Due +to the lack of standard benchmarks for the novel setting of generating visually +grounded detailed conversations, we introduce a comprehensive evaluation +protocol with our curated grounded conversations. Our proposed Grounded +Conversation Generation (GCG) task requires densely grounded concepts in +natural scenes at a large-scale. To this end, we propose a densely annotated +Grounding-anything Dataset (GranD) using our proposed automated annotation +pipeline that encompasses 7.5M unique concepts grounded in a total of 810M +regions available with segmentation masks. Besides GCG, GLaMM also performs +effectively on several downstream tasks e.g., referring expression +segmentation, image and region-level captioning and vision-language +conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.",cs.CV,"['cs.CV', 'cs.AI']" +pix2gestalt: Amodal Segmentation by Synthesizing Wholes,Ege Ozguroglu · Ruoshi Liu · Dídac Surís · Dian Chen · Achal Dave · Pavel Tokmakov · Carl Vondrick, ,https://arxiv.org/abs/2401.14398,,2401.14398.pdf,pix2gestalt: Amodal Segmentation by Synthesizing Wholes,"We introduce pix2gestalt, a framework for zero-shot amodal segmentation, +which learns to estimate the shape and appearance of whole objects that are +only partially visible behind occlusions. By capitalizing on large-scale +diffusion models and transferring their representations to this task, we learn +a conditional diffusion model for reconstructing whole objects in challenging +zero-shot cases, including examples that break natural and physical priors, +such as art. As training data, we use a synthetically curated dataset +containing occluded objects paired with their whole counterparts. Experiments +show that our approach outperforms supervised baselines on established +benchmarks. Our model can furthermore be used to significantly improve the +performance of existing object recognition and 3D reconstruction methods in the +presence of occlusions.",cs.CV,"['cs.CV', 'cs.LG']" +LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation,Xuecan Wang · Shibang Xiao · Xiaohui Liang, ,https://arxiv.org/abs/2404.03925,,2404.03925.pdf,LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation,"We present a lightweight solution for estimating spatially-coherent indoor +lighting from a single RGB image. Previous methods for estimating illumination +using volumetric representations have overlooked the sparse distribution of +light sources in space, necessitating substantial memory and computational +resources for achieving high-quality results. We introduce a unified, voxel +octree-based illumination estimation framework to produce 3D spatially-coherent +lighting. Additionally, a differentiable voxel octree cone tracing rendering +layer is proposed to eliminate regular volumetric representation throughout the +entire process and ensure the retention of features across different frequency +domains. This reduction significantly decreases spatial usage and required +floating-point operations without substantially compromising precision. +Experimental results demonstrate that our approach achieves high-quality +coherent estimation with minimal cost compared to previous methods.",cs.CV,['cs.CV'] +3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis,Zhicheng Lu · xiang guo · Le Hui · Tianrui Chen · Min Yang · Xiao Tang · feng zhu · Yuchao Dai, ,https://arxiv.org/abs/2404.06270,,2404.06270.pdf,3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis,"In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting +method for dynamic view synthesis. Existing neural radiance fields (NeRF) based +solutions learn the deformation in an implicit manner, which cannot incorporate +3D scene geometry. Therefore, the learned deformation is not necessarily +geometrically coherent, which results in unsatisfactory dynamic view synthesis +and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new +representation of the 3D scene, building upon which the 3D geometry could be +exploited in learning the complex 3D deformation. Specifically, the scenes are +represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized +to move and rotate over time to model the deformation. To enforce the 3D scene +geometry constraint during deformation, we explicitly extract 3D geometry +features and integrate them in learning the 3D deformation. In this way, our +solution achieves 3D geometry-aware deformation modeling, which enables +improved dynamic view synthesis and 3D dynamic reconstruction. Extensive +experimental results on both synthetic and real datasets prove the superiority +of our solution, which achieves new state-of-the-art performance. + The project is available at https://npucvr.github.io/GaGS/",cs.CV,['cs.CV'] +PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection,Qihang Ma · Zhizhong Zhang · Xin Tan · Yanyun Qu · Chengwei Chen · Yuan Xie · Lizhuang Ma, ,https://arxiv.org/abs/2404.05231,,2404.05231.pdf,PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection,"The vision-language model has brought great improvement to few-shot +industrial anomaly detection, which usually needs to design of hundreds of +prompts through prompt engineering. For automated scenarios, we first use +conventional prompt learning with many-class paradigm as the baseline to +automatically learn prompts but found that it can not work well in one-class +anomaly detection. To address the above problem, this paper proposes a +one-class prompt learning method for few-shot anomaly detection, termed +PromptAD. First, we propose semantic concatenation which can transpose normal +prompts into anomaly prompts by concatenating normal prompts with anomaly +suffixes, thus constructing a large number of negative samples used to guide +prompt learning in one-class setting. Furthermore, to mitigate the training +challenge caused by the absence of anomaly images, we introduce the concept of +explicit anomaly margin, which is used to explicitly control the margin between +normal prompt features and anomaly prompt features through a hyper-parameter. +For image-level/pixel-level anomaly detection, PromptAD achieves first place in +11/12 few-shot settings on MVTec and VisA.",cs.CV,['cs.CV'] +Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation,Dong Lao · Congli Wang · Alex Wong · Stefano Soatto, ,,https://www.semanticscholar.org/paper/Diffeomorphic-Template-Registration-for-Atmospheric-Lao-Wang/d03a9da146a21840a76c6a42b1a1572736fe5a14/figure/2,,,,,nan +From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers,Swaminathan Gurumurthy · Karnik Ram · Bingqing Chen · Zachary Manchester · Zico Kolter, ,https://arxiv.org/abs/2307.08873,,2307.08873.pdf,An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient,"Restricting the variance of a policy's return is a popular choice in +risk-averse Reinforcement Learning (RL) due to its clear mathematical +definition and easy interpretability. Traditional methods directly restrict the +total return variance. Recent methods restrict the per-step reward variance as +a proxy. We thoroughly examine the limitations of these variance-based methods, +such as sensitivity to numerical scale and hindering of policy learning, and +propose to use an alternative risk measure, Gini deviation, as a substitute. We +study various properties of this new risk measure and derive a policy gradient +algorithm to minimize it. Empirical evaluation in domains where risk-aversion +can be clearly defined, shows that our algorithm can mitigate the limitations +of variance-based risk measures and achieves high return with low risk in terms +of variance and Gini deviation when others fail to learn a reasonable policy.",cs.LG,"['cs.LG', 'cs.AI']" +HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D,Sangmin Woo · byeongjun park · Hyojun Go · Jin-Young Kim · Changick Kim, ,,https://github.com/byeongjun-park/HarmonyView,,,,,nan +Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform,Chunghyun Park · Seungwook Kim · Jaesik Park · Minsu Cho, ,https://arxiv.org/abs/2404.11156,,2404.11156.pdf,Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform,"Establishing accurate 3D correspondences between shapes stands as a pivotal +challenge with profound implications for computer vision and robotics. However, +existing self-supervised methods for this problem assume perfect input shape +alignment, restricting their real-world applicability. In this work, we +introduce a novel self-supervised Rotation-Invariant 3D correspondence learner +with Local Shape Transform, dubbed RIST, that learns to establish dense +correspondences between shapes even under challenging intra-class variations +and arbitrary orientations. Specifically, RIST learns to dynamically formulate +an SO(3)-invariant local shape transform for each point, which maps the +SO(3)-equivariant global shape descriptor of the input shape to a local shape +descriptor. These local shape descriptors are provided as inputs to our decoder +to facilitate point cloud self- and cross-reconstruction. Our proposed +self-supervised training pipeline encourages semantically corresponding points +from different shapes to be mapped to similar local shape descriptors, enabling +RIST to establish dense point-wise correspondences. RIST demonstrates +state-of-the-art performances on 3D part label transfer and semantic keypoint +transfer given arbitrarily rotated point cloud pairs, outperforming existing +methods by significant margins.",cs.CV,['cs.CV'] +BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation,Jiahao Lu · Jiacheng Deng · Tianzhu Zhang, ,https://arxiv.org/abs/2403.15019,,2403.15019.pdf,BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation,"3D instance segmentation (3DIS) is a crucial task, but point-level +annotations are tedious in fully supervised settings. Thus, using bounding +boxes (bboxes) as annotations has shown great potential. The current mainstream +approach is a two-step process, involving the generation of pseudo-labels from +box annotations and the training of a 3DIS network with the pseudo-labels. +However, due to the presence of intersections among bboxes, not every point has +a determined instance label, especially in overlapping areas. To generate +higher quality pseudo-labels and achieve more precise weakly supervised 3DIS +results, we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D +Instance Segmentation (BSNet), which devises a novel pseudo-labeler called +Simulation-assisted Transformer. The labeler consists of two main components. +The first is Simulation-assisted Mean Teacher, which introduces Mean Teacher +for the first time in this task and constructs simulated samples to assist the +labeler in acquiring prior knowledge about overlapping areas. To better model +local-global structure, we also propose Local-Global Aware Attention as the +decoder for teacher and student labelers. Extensive experiments conducted on +the ScanNetV2 and S3DIS datasets verify the superiority of our designs. Code is +available at +\href{https://github.com/peoplelu/BSNet}{https://github.com/peoplelu/BSNet}.",cs.CV,['cs.CV'] +Motion Diversification Networks,Hee Jae Kim · Eshed Ohn-Bar, ,,https://www.kdramastars.com/articles/131362/20230922/moving-actor-stuns-viewers-unrecognizable-transformation-villain.htm,,,,,nan +PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks,Marina Neseem · Conor McCullough · Randy Hsin · Chas Leichner · Shan Li · In Suk Chong · Andrew Howard · Lukasz Lew · Sherief Reda · Ville-Mikko Rautio · Daniele Moro, ,https://arxiv.org/abs/2404.00103,,2404.00103.pdf,PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks,"Low-precision quantization is recognized for its efficacy in neural network +optimization. Our analysis reveals that non-quantized elementwise operations +which are prevalent in layers such as parameterized activation functions, batch +normalization, and quantization scaling dominate the inference cost of +low-precision models. These non-quantized elementwise operations are commonly +overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort +(ACE). In this paper, we propose ACEv2 - an extended version of ACE which +offers a better alignment with the inference cost of quantized models and their +energy consumption on ML hardware. Moreover, we introduce PikeLPN, a model that +addresses these efficiency issues by applying quantization to both elementwise +operations and multiply-accumulate operations. In particular, we present a +novel quantization technique for batch normalization layers named QuantNorm +which allows for quantizing the batch normalization parameters without +compromising the model performance. Additionally, we propose applying Double +Quantization where the quantization scaling parameters are quantized. +Furthermore, we recognize and resolve the issue of distribution mismatch in +Separable Convolution layers by introducing Distribution-Heterogeneous +Quantization which enables quantizing them to low-precision. PikeLPN achieves +Pareto-optimality in efficiency-accuracy trade-off with up to 3X efficiency +improvement compared to SOTA low-precision models.",cs.LG,"['cs.LG', 'cs.CV']" +Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed,Yifan Wang · Xingyi He · Sida Peng · Dongli Tan · Xiaowei Zhou, ,https://arxiv.org/abs/2403.04765,,2403.04765.pdf,Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed,"We present a novel method for efficiently producing semi-dense matches across +images. Previous detector-free matcher LoFTR has shown remarkable matching +capability in handling large-viewpoint change and texture-poor scenarios but +suffers from low efficiency. We revisit its design choices and derive multiple +improvements for both efficiency and accuracy. One key observation is that +performing the transformer over the entire feature map is redundant due to +shared local information, therefore we propose an aggregated attention +mechanism with adaptive token selection for efficiency. Furthermore, we find +spatial variance exists in LoFTR's fine correlation module, which is adverse to +matching accuracy. A novel two-stage correlation layer is proposed to achieve +accurate subpixel correspondences for accuracy improvement. Our efficiency +optimized model is $\sim 2.5\times$ faster than LoFTR which can even surpass +state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. +Moreover, extensive experiments show that our method can achieve higher +accuracy compared with competitive semi-dense matchers, with considerable +efficiency benefits. This opens up exciting prospects for large-scale or +latency-sensitive applications such as image retrieval and 3D reconstruction. +Project page: https://zju3dv.github.io/efficientloftr.",cs.CV,['cs.CV'] +"Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly",Hang Du · Sicheng Zhang · Binzhu Xie · Guoshun Nan · Jiayang Zhang · Junrui Xu · Hangyu Liu · Sicong Leng · Jiangming Liu · Hehe Fan · Dajiu Huang · Jing Feng · Linli Chen · Can Zhang · Xuhuan Li · Hao Zhang · Jianhang Chen · Qimei Cui · Xiaofeng Tao, ,https://arxiv.org/abs/2405.00181,,2405.00181.pdf,"Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly","Video anomaly understanding (VAU) aims to automatically comprehend unusual +occurrences in videos, thereby enabling various applications such as traffic +surveillance and industrial manufacturing. While existing VAU benchmarks +primarily concentrate on anomaly detection and localization, our focus is on +more practicality, prompting us to raise the following crucial questions: ""what +anomaly occurred?"", ""why did it happen?"", and ""how severe is this abnormal +event?"". In pursuit of these answers, we present a comprehensive benchmark for +Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of +the proposed benchmark involves three sets of human annotations to indicate the +""what"", ""why"" and ""how"" of an anomaly, including 1) anomaly type, start and end +times, and event descriptions, 2) natural language explanations for the cause +of an anomaly, and 3) free text reflecting the effect of the abnormality. In +addition, we also introduce MMEval, a novel evaluation metric designed to +better align with human preferences for CUVA, facilitating the measurement of +existing LLMs in comprehending the underlying cause and corresponding effect of +video anomalies. Finally, we propose a novel prompt-based method that can serve +as a baseline approach for the challenging CUVA. We conduct extensive +experiments to show the superiority of our evaluation metric and the +prompt-based approach. Our code and dataset are available at +https://github.com/fesvhtr/CUVA.",cs.CV,"['cs.CV', 'cs.AI']" +GART: Gaussian Articulated Template Models,Jiahui Lei · Yufu Wang · Georgios Pavlakos · Lingjie Liu · Kostas Daniilidis, ,https://arxiv.org/abs/2311.16099,,2311.16099.pdf,GART: Gaussian Articulated Template Models,"We introduce Gaussian Articulated Template Model GART, an explicit, +efficient, and expressive representation for non-rigid articulated subject +capturing and rendering from monocular videos. GART utilizes a mixture of +moving 3D Gaussians to explicitly approximate a deformable subject's geometry +and appearance. It takes advantage of a categorical template model prior (SMPL, +SMAL, etc.) with learnable forward skinning while further generalizing to more +complex non-rigid deformations with novel latent bones. GART can be +reconstructed via differentiable rendering from monocular videos in seconds or +minutes and rendered in novel poses faster than 150fps.",cs.CV,"['cs.CV', 'cs.GR']" +Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering,Tao Lu · Mulin Yu · Linning Xu · Yuanbo Xiangli · Limin Wang · Dahua Lin · Bo Dai, ,https://arxiv.org/abs/2312.00109,,2312.00109.pdf,Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering,"Neural rendering methods have significantly advanced photo-realistic 3D scene +rendering in various academic and industrial applications. The recent 3D +Gaussian Splatting method has achieved the state-of-the-art rendering quality +and speed combining the benefits of both primitive-based representations and +volumetric representations. However, it often leads to heavily redundant +Gaussians that try to fit every training view, neglecting the underlying scene +geometry. Consequently, the resulting model becomes less robust to significant +view changes, texture-less area and lighting effects. We introduce Scaffold-GS, +which uses anchor points to distribute local 3D Gaussians, and predicts their +attributes on-the-fly based on viewing direction and distance within the view +frustum. Anchor growing and pruning strategies are developed based on the +importance of neural Gaussians to reliably improve the scene coverage. We show +that our method effectively reduces redundant Gaussians while delivering +high-quality rendering. We also demonstrates an enhanced capability to +accommodate scenes with varying levels-of-detail and view-dependent +observations, without sacrificing the rendering speed.",cs.CV,['cs.CV'] +DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF,Jie Long Lee · Chen Li · Gim Hee Lee, ,https://arxiv.org/abs/2404.00874,,2404.00874.pdf,DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF,"We present DiSR-NeRF, a diffusion-guided framework for view-consistent +super-resolution (SR) NeRF. Unlike prior works, we circumvent the requirement +for high-resolution (HR) reference images by leveraging existing powerful 2D +super-resolution models. Nonetheless, independent SR 2D images are often +inconsistent across different views. We thus propose Iterative 3D +Synchronization (I3DS) to mitigate the inconsistency problem via the inherent +multi-view consistency property of NeRF. Specifically, our I3DS alternates +between upscaling low-resolution (LR) rendered images with diffusion models, +and updating the underlying 3D representation with standard NeRF training. We +further introduce Renoised Score Distillation (RSD), a novel score-distillation +objective for 2D image resolution. Our RSD combines features from ancestral +sampling and Score Distillation Sampling (SDS) to generate sharp images that +are also LR-consistent. Qualitative and quantitative results on both synthetic +and real-world datasets demonstrate that our DiSR-NeRF can achieve better +results on NeRF super-resolution compared with existing works. Code and video +results available at the project website.",cs.CV,['cs.CV'] +SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction,Yuanhui Huang · Wenzhao Zheng · Borui Zhang · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2311.12754,,2311.12754.pdf,SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction,"3D occupancy prediction is an important task for the robustness of +vision-centric autonomous driving, which aims to predict whether each point is +occupied in the surrounding 3D space. Existing methods usually require 3D +occupancy labels to produce meaningful results. However, it is very laborious +to annotate the occupancy status of each voxel. In this paper, we propose +SelfOcc to explore a self-supervised way to learn 3D occupancy using only video +sequences. We first transform the images into the 3D space (e.g., bird's eye +view) to obtain 3D representation of the scene. We directly impose constraints +on the 3D representations by treating them as signed distance fields. We can +then render 2D images of previous and future frames as self-supervision signals +to learn the 3D representations. We propose an MVS-embedded strategy to +directly optimize the SDF-induced weights with multiple depth proposals. Our +SelfOcc outperforms the previous best method SceneRF by 58.7% using a single +frame as input on SemanticKITTI and is the first self-supervised work that +produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc +produces high-quality depth and achieves state-of-the-art results on novel +depth synthesis, monocular depth estimation, and surround-view depth estimation +on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: +https://github.com/huang-yh/SelfOcc.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Seeing the Unseen: Visual Common Sense for Semantic Placement,Ram Ramrakhya · Aniruddha Kembhavi · Dhruv Batra · Zsolt Kira · Kuo-Hao Zeng · Luca Weihs, ,https://arxiv.org/abs/2401.07770,,2401.07770.pdf,Seeing the Unseen: Visual Common Sense for Semantic Placement,"Computer vision tasks typically involve describing what is present in an +image (e.g. classification, detection, segmentation, and captioning). We study +a visual common sense task that requires understanding what is not present. +Specifically, given an image (e.g. of a living room) and name of an object +(""cushion""), a vision system is asked to predict semantically-meaningful +regions (masks or bounding boxes) in the image where that object could be +placed or is likely be placed by humans (e.g. on the sofa). We call this task: +Semantic Placement (SP) and believe that such common-sense visual understanding +is critical for assitive robots (tidying a house), and AR devices +(automatically rendering an object in the user's space). Studying the invisible +is hard. Datasets for image description are typically constructed by curating +relevant images and asking humans to annotate the contents of the image; +neither of those two steps are straightforward for objects not present in the +image. We overcome this challenge by operating in the opposite direction: we +start with an image of an object in context from web, and then remove that +object from the image via inpainting. This automated pipeline converts +unstructured web data into a dataset comprising pairs of images with/without +the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images +across $9$ object categories, and train a SP prediction model called CLIP-UNet. +CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors +with object detectors on real-world and simulated images. In our user studies, +we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and +$31.3\%$ times when comparing against the $4$ SP baselines on real and +simulated images. In addition, we demonstrate leveraging SP mask predictions +from CLIP-UNet enables downstream applications like building tidying robots in +indoor environments.",cs.CV,['cs.CV'] +Non-autoregressive Sequence-to-Sequence Vision-Language Models,Kunyu Shi · Qi Dong · Luis Goncalves · Zhuowen Tu · Stefano Soatto, ,https://arxiv.org/abs/2403.02249,,2403.02249.pdf,Non-autoregressive Sequence-to-Sequence Vision-Language Models,"Sequence-to-sequence vision-language models are showing promise, but their +applicability is limited by their inference latency due to their autoregressive +way of generating predictions. We propose a parallel decoding +sequence-to-sequence vision-language model, trained with a Query-CTC loss, that +marginalizes over multiple inference paths in the decoder. This allows us to +model the joint distribution of tokens, rather than restricting to conditional +distribution as in an autoregressive model. The resulting model, NARVL, +achieves performance on-par with its state-of-the-art autoregressive +counterpart, but is faster at inference time, reducing from the linear +complexity associated with the sequential generation of tokens to a paradigm of +constant time joint inference.",cs.CV,"['cs.CV', 'cs.AI']" +Deep Video Inverse Tone Mapping Based on Temporal Clues,Yuyao Ye · Ning Zhang · Yang Zhao · Hongbin Cao · Ronggang Wang, ,,https://dl.acm.org/doi/10.1145/3648570,,,,,nan +L2B: Learning to Bootstrap Robust Models for Combating Label Noise,Yuyin Zhou · Xianhang li · Fengze Liu · Qingyue Wei · Xuxi Chen · Lequan Yu · Cihang Xie · Matthew P. Lungren · Lei Xing, ,,https://link.springer.com/chapter/10.1007/978-3-031-43415-0_1,,,,,nan +Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation,Sangyun Shin · Kaichen Zhou · Madhu Vankadari · Andrew Markham · Niki Trigoni, ,https://arxiv.org/abs/2312.11269,,2312.11269.pdf,Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation,"Coarse-to-fine 3D instance segmentation methods show weak performances +compared to recent Grouping-based, Kernel-based and Transformer-based methods. +We argue that this is due to two limitations: 1) Instance size overestimation +by axis-aligned bounding box(AABB) 2) False negative error accumulation from +inaccurate box to the refinement phase. In this work, we introduce Spherical +Mask, a novel coarse-to-fine approach based on spherical representation, +overcoming those two limitations with several benefits. Specifically, our +coarse detection estimates each instance with a 3D polygon using a center and +radial distance predictions, which avoids excessive size estimation of AABB. To +cut the error propagation in the existing coarse-to-fine approaches, we +virtually migrate points based on the polygon, allowing all foreground points, +including false negatives, to be refined. During inference, the proposal and +point migration modules run in parallel and are assembled to form binary masks +of instances. We also introduce two margin-based losses for the point migration +to enforce corrections for the false positives/negatives and cohesion of +foreground points, significantly improving the performance. Experimental +results from three datasets, such as ScanNetV2, S3DIS, and STPLS3D, show that +our proposed method outperforms existing works, demonstrating the effectiveness +of the new instance representation with spherical coordinates.",cs.CV,"['cs.CV', 'cs.LG']" +DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model,Zhenghao Pan · Haijin Zeng · Jiezhang Cao · Kai Zhang · Yongyong Chen,https://github.com/PAN083/DiffSCI,https://arxiv.org/abs/2311.11417,,2311.11417.pdf,DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model,"This paper endeavors to advance the precision of snapshot compressive imaging +(SCI) reconstruction for multispectral image (MSI). To achieve this, we +integrate the advantageous attributes of established SCI techniques and an +image generative model, propose a novel structured zero-shot diffusion model, +dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior +and optimization-based methodologies, complemented by the generative +capabilities offered by the contemporary denoising diffusion model. +Specifically, firstly, we employ a pre-trained diffusion model, which has been +trained on a substantial corpus of RGB images, as the generative denoiser +within the Plug-and-Play framework for the first time. This integration allows +for the successful completion of SCI reconstruction, especially in the case +that current methods struggle to address effectively. Secondly, we +systematically account for spectral band correlations and introduce a robust +methodology to mitigate wavelength mismatch, thus enabling seamless adaptation +of the RGB diffusion model to MSIs. Thirdly, an accelerated algorithm is +implemented to expedite the resolution of the data subproblem. This +augmentation not only accelerates the convergence rate but also elevates the +quality of the reconstruction process. We present extensive testing to show +that DiffSCI exhibits discernible performance enhancements over prevailing +self-supervised and zero-shot approaches, surpassing even supervised +transformer counterparts across both simulated and real datasets. Our code will +be available.",cs.CV,['cs.CV'] +$\mathsf{LQMFormer}$:~Language-aware Query Mask Transformer for Referring Image Segmentation,Nisarg Shah · Vibashan VS · Vishal M. Patel, ,https://arxiv.org/abs/2312.12198,,,Mask Grounding for Referring Image Segmentation,"Referring Image Segmentation (RIS) is a challenging task that requires an +algorithm to segment objects referred by free-form language expressions. +Despite significant progress in recent years, most state-of-the-art (SOTA) +methods still suffer from considerable language-image modality gap at the pixel +and word level. These methods generally 1) rely on sentence-level language +features for language-image alignment and 2) lack explicit training supervision +for fine-grained visual grounding. Consequently, they exhibit weak object-level +correspondence between visual and language features. Without well-grounded +features, prior methods struggle to understand complex expressions that require +strong reasoning over relationships among multiple objects, especially when +dealing with rarely used or ambiguous clauses. To tackle this challenge, we +introduce a novel Mask Grounding auxiliary task that significantly improves +visual grounding within language features, by explicitly teaching the model to +learn fine-grained correspondence between masked textual tokens and their +matching visual objects. Mask Grounding can be directly used on prior RIS +methods and consistently bring improvements. Furthermore, to holistically +address the modality gap, we also design a cross-modal alignment loss and an +accompanying alignment module. These additions work synergistically with Mask +Grounding. With all these techniques, our comprehensive approach culminates in +MagNet (Mask-grounded Network), an architecture that significantly outperforms +prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating +our method's effectiveness in addressing current limitations of RIS algorithms. +Our code and pre-trained weights will be released.",cs.CV,['cs.CV'] +CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor,Shuyang Sun · Runjia Li · Philip H.S. Torr · Xiuye Gu · Siyang Li, ,https://arxiv.org/abs/2312.07661,,2312.07661.pdf,CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor,"Existing open-vocabulary image segmentation methods require a fine-tuning +step on mask labels and/or image-text datasets. Mask labels are +labor-intensive, which limits the number of categories in segmentation +datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely +reduced after fine-tuning. However, without fine-tuning, VLMs trained under +weak image-text supervision tend to make suboptimal mask predictions. To +alleviate these issues, we introduce a novel recurrent framework that +progressively filters out irrelevant texts and enhances mask quality without +training efforts. The recurrent unit is a two-stage segmenter built upon a +frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips +it with segmentation ability. Experiments show that our method outperforms not +only the training-free counterparts, but also those fine-tuned with millions of +data samples, and sets the new state-of-the-art records for both zero-shot +semantic and referring segmentation. Concretely, we improve the current record +by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG', 'cs.MM']" +Improving Generalization via Meta-Learning on Hard Samples,Nishant Jain · Arun Suggala · Pradeep Shenoy, ,https://arxiv.org/abs/2403.12236,,2403.12236.pdf,Improving Generalization via Meta-Learning on Hard Samples,"Learned reweighting (LRW) approaches to supervised learning use an +optimization criterion to assign weights for training instances, in order to +maximize performance on a representative validation dataset. We pose and +formalize the problem of optimized selection of the validation set used in LRW +training, to improve classifier generalization. In particular, we show that +using hard-to-classify instances in the validation set has both a theoretical +connection to, and strong empirical evidence of generalization. We provide an +efficient algorithm for training this meta-optimized model, as well as a simple +train-twice heuristic for careful comparative study. We demonstrate that LRW +with easy validation data performs consistently worse than LRW with hard +validation data, establishing the validity of our meta-optimization problem. +Our proposed algorithm outperforms a wide range of baselines on a range of +datasets and domain shift challenges (Imagenet-1K, CIFAR-100, Clothing-1M, +CAMELYON, WILDS, etc.), with ~1% gains using VIT-B on Imagenet. We also show +that using naturally hard examples for validation (Imagenet-R / Imagenet-A) in +LRW training for Imagenet improves performance on both clean and naturally hard +test instances by 1-2%. Secondary analyses show that using hard validation data +in an LRW framework improves margins on test data, hinting at the mechanism +underlying our empirical gains. We believe this work opens up new research +directions for the meta-optimization of meta-learning in a supervised learning +context.",cs.LG,"['cs.LG', 'cs.CV']" +PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns,Shuliang Ning · Duomin Wang · Yipeng Qin · Zirong Jin · Baoyuan Wang · Xiaoguang Han, ,https://arxiv.org/abs/2312.04534,,2312.04534.pdf,PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns,"In this paper, we propose a novel virtual try-on from unconstrained designs +(ucVTON) task to enable photorealistic synthesis of personalized composite +clothing on input human images. Unlike prior arts constrained by specific input +types, our method allows flexible specification of style (text or image) and +texture (full garment, cropped sections, or texture patches) conditions. To +address the entanglement challenge when using full garment images as +conditions, we develop a two-stage pipeline with explicit disentanglement of +style and texture. In the first stage, we generate a human parsing map +reflecting the desired style conditioned on the input. In the second stage, we +composite textures onto the parsing map areas based on the texture input. To +represent complex and non-stationary textures that have never been achieved in +previous fashion editing works, we first propose extracting hierarchical and +balanced CLIP features and applying position encoding in VTON. Experiments +demonstrate superior synthesis quality and personalization enabled by our +method. The flexible control over style and texture mixing brings virtual +try-on to a new level of user experience for online shopping and fashion +design.",cs.CV,['cs.CV'] +KPConvX: Modernizing Kernel Point Convolution with Kernel Attention,Hugues Thomas · Yao-Hung Hubert Tsai · Timothy Barfoot · Jian Zhang, ,https://arxiv.org/abs/2405.13194,,2405.13194.pdf,KPConvX: Modernizing Kernel Point Convolution with Kernel Attention,"In the field of deep point cloud understanding, KPConv is a unique +architecture that uses kernel points to locate convolutional weights in space, +instead of relying on Multi-Layer Perceptron (MLP) encodings. While it +initially achieved success, it has since been surpassed by recent MLP networks +that employ updated designs and training strategies. Building upon the kernel +point principle, we present two novel designs: KPConvD (depthwise KPConv), a +lighter design that enables the use of deeper architectures, and KPConvX, an +innovative design that scales the depthwise convolutional weights of KPConvD +with kernel attention values. Using KPConvX with a modern architecture and +training strategy, we are able to outperform current state-of-the-art +approaches on the ScanObjectNN, Scannetv2, and S3DIS datasets. We validate our +design choices through ablation studies and release our code and models.",cs.CV,['cs.CV'] +FedAS: Bridging Inconsistency in Personalized Federated Learning,Xiyuan Yang · Wenke Huang · Mang Ye,https://github.com/xiyuanyang45/FedAS,,https://dl.acm.org/doi/10.5555/3666122.3669282,,,,,nan +DeIl: Direct and Inverse CLIP for Open-World Few-Shot Learning,Shuai Shao · Yu Bai · Yan WANG · Bao-di Liu · Yicong Zhou, ,,https://www.semanticscholar.org/paper/Collaborative-Consortium-of-Foundation-Models-for-Shao-Bai/90668de8b1c5dcb0471444e3177dc28e20fce5d4,,,,,nan +"Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding",Wujian Peng · Sicheng Xie · Zuyao You · Shiyi Lan · Zuxuan Wu,https://github.com/wjpoom/SPEC,https://arxiv.org/abs/2312.00081,,2312.00081.pdf,"Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding","Vision language models (VLM) have demonstrated remarkable performance across +various downstream tasks. However, understanding fine-grained visual-linguistic +concepts, such as attributes and inter-object relationships, remains a +significant challenge. While several benchmarks aim to evaluate VLMs in finer +granularity, their primary focus remains on the linguistic aspect, neglecting +the visual dimension. Here, we highlight the importance of evaluating VLMs from +both a textual and visual perspective. We introduce a progressive pipeline to +synthesize images that vary in a specific attribute while ensuring consistency +in all other aspects. Utilizing this data engine, we carefully design a +benchmark, SPEC, to diagnose the comprehension of object size, position, +existence, and count. Subsequently, we conduct a thorough evaluation of four +leading VLMs on SPEC. Surprisingly, their performance is close to random guess, +revealing significant limitations. With this in mind, we propose a simple yet +effective approach to optimize VLMs in fine-grained understanding, achieving +significant improvements on SPEC without compromising the zero-shot +performance. Results on two additional fine-grained benchmarks also show +consistent improvements, further validating the transferability of our +approach. Code and data are available at https://github.com/wjpoom/SPEC.",cs.CV,['cs.CV'] +CPR: Retrieval Augmented Generation for Copyright Protection,Aditya Golatkar · Alessandro Achille · Luca Zancato · Yu-Xiang Wang · Ashwin Swaminathan · Stefano Soatto · Stefano Soatto, ,https://arxiv.org/abs/2403.18920,,2403.18920.pdf,CPR: Retrieval Augmented Generation for Copyright Protection,"Retrieval Augmented Generation (RAG) is emerging as a flexible and robust +technique to adapt models to private users data without training, to handle +credit attribution, and to allow efficient machine unlearning at scale. +However, RAG techniques for image generation may lead to parts of the retrieved +samples being copied in the model's output. To reduce risks of leaking private +information contained in the retrieved set, we introduce Copy-Protected +generation with Retrieval (CPR), a new method for RAG with strong copyright +protection guarantees in a mixed-private setting for diffusion models.CPR +allows to condition the output of diffusion models on a set of retrieved +images, while also guaranteeing that unique identifiable information about +those example is not exposed in the generated outputs. In particular, it does +so by sampling from a mixture of public (safe) distribution and private (user) +distribution by merging their diffusion scores at inference. We prove that CPR +satisfies Near Access Freeness (NAF) which bounds the amount of information an +attacker may be able to extract from the generated images. We provide two +algorithms for copyright protection, CPR-KL and CPR-Choose. Unlike previously +proposed rejection-sampling-based NAF methods, our methods enable efficient +copyright-protected sampling with a single run of backward diffusion. We show +that our method can be applied to any pre-trained conditional diffusion model, +such as Stable Diffusion or unCLIP. In particular, we empirically show that +applying CPR on top of unCLIP improves quality and text-to-image alignment of +the generated results (81.4 to 83.17 on TIFA benchmark), while enabling credit +attribution, copy-right protection, and deterministic, constant time, +unlearning.",cs.CR,"['cs.CR', 'cs.AI', 'cs.CV']" +FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models,Shivangi Aneja · Justus Thies · Angela Dai · Matthias Nießner, ,https://arxiv.org/abs/2312.08459,,2312.08459.pdf,FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models,"We introduce FaceTalk, a novel generative approach designed for synthesizing +high-fidelity 3D motion sequences of talking human heads from input audio +signal. To capture the expressive, detailed nature of human heads, including +hair, ears, and finer-scale eye movements, we propose to couple speech signal +with the latent space of neural parametric head models to create high-fidelity, +temporally coherent motion sequences. We propose a new latent diffusion model +for this task, operating in the expression space of neural parametric head +models, to synthesize audio-driven realistic head sequences. In the absence of +a dataset with corresponding NPHM expressions to audio, we optimize for these +correspondences to produce a dataset of temporally-optimized NPHM expressions +fit to audio-video recordings of people talking. To the best of our knowledge, +this is the first work to propose a generative approach for realistic and +high-quality motion synthesis of volumetric human heads, representing a +significant advancement in the field of audio-driven 3D animation. Notably, our +approach stands out in its ability to generate plausible motion sequences that +can produce high-fidelity head animation coupled with the NPHM shape space. Our +experimental results substantiate the effectiveness of FaceTalk, consistently +achieving superior and visually natural motion, encompassing diverse facial +expressions and styles, outperforming existing methods by 75% in perceptual +user study evaluation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.SD', 'eess.AS']" +Binding Touch to Everything: Learning Unified Multimodal Tactile Representations,Fengyu Yang · Chao Feng · Ziyang Chen · Hyoungseob Park · Daniel Wang · Yiming Dou · Ziyao Zeng · xien chen · Suchisrit Gangopadhyay · Andrew Owens · Alex Wong, ,https://arxiv.org/abs/2401.18084,,2401.18084.pdf,Binding Touch to Everything: Learning Unified Multimodal Tactile Representations,"The ability to associate touch with other modalities has huge implications +for humans and computational systems. However, multimodal learning with touch +remains challenging due to the expensive data collection process and +non-standardized sensor outputs. We introduce UniTouch, a unified tactile model +for vision-based touch sensors connected to multiple modalities, including +vision, language, and sound. We achieve this by aligning our UniTouch +embeddings to pretrained image embeddings already associated with a variety of +other modalities. We further propose learnable sensor-specific tokens, allowing +the model to learn from a set of heterogeneous tactile sensors, all at the same +time. UniTouch is capable of conducting various touch sensing tasks in the +zero-shot setting, from robot grasping prediction to touch image question +answering. To the best of our knowledge, UniTouch is the first to demonstrate +such capabilities. Project page: https://cfeng16.github.io/UniTouch/",cs.CV,"['cs.CV', 'cs.RO']" +Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles,Vanessa Sklyarova · Egor Zakharov · Otmar Hilliges · Michael J. Black · Justus Thies,https://haar.is.tue.mpg.de/,https://arxiv.org/abs/2312.11666,,2312.11666.pdf,HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles,"We present HAAR, a new strand-based generative model for 3D human hairstyles. +Specifically, based on textual inputs, HAAR produces 3D hairstyles that could +be used as production-level assets in modern computer graphics engines. Current +AI-based generative models take advantage of powerful 2D priors to reconstruct +3D content in the form of point clouds, meshes, or volumetric functions. +However, by using the 2D priors, they are intrinsically limited to only +recovering the visual parts. Highly occluded hair structures can not be +reconstructed with those methods, and they only model the ''outer shell'', +which is not ready to be used in physics-based rendering or simulation +pipelines. In contrast, we propose a first text-guided generative method that +uses 3D hair strands as an underlying representation. Leveraging 2D visual +question-answering (VQA) systems, we automatically annotate synthetic hair +models that are generated from a small set of artist-created hairstyles. This +allows us to train a latent diffusion model that operates in a common hairstyle +UV space. In qualitative and quantitative studies, we demonstrate the +capabilities of the proposed model and compare it to existing hairstyle +generation approaches.",cs.CV,"['cs.CV', 'cs.GR']" +Sieve: Multimodal Dataset Pruning using Image-Captioning Models,Anas Mahmoud · Mostafa Elhoushi · Amro Abbas · Yu Yang · Newsha Ardalani · Hugh Leather · Ari Morcos, ,https://arxiv.org/abs/2310.02110,,2310.02110.pdf,Sieve: Multimodal Dataset Pruning Using Image Captioning Models,"Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy +web-crawled datasets. This underscores the critical need for dataset pruning, +as the quality of these datasets is strongly correlated with the performance of +VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train +models using highly-aligned samples is one of the most successful methods for +pruning. We argue that this approach suffers from multiple limitations +including: false positives and negatives due to CLIP's pretraining on noisy +labels. We propose a pruning signal, Sieve, that employs synthetic captions +generated by image-captioning models pretrained on small, diverse, and +well-aligned image-text pairs to evaluate the alignment of noisy image-text +pairs. To bridge the gap between the limited diversity of generated captions +and the high diversity of alternative text (alt-text), we estimate the semantic +textual similarity in the embedding space of a language model pretrained on +unlabeled text corpus. Using DataComp, a multimodal dataset filtering +benchmark, when evaluating on 38 downstream tasks, our pruning approach, +surpasses CLIPScore by 2.6\% and 1.7\% on medium and large scale respectively. +In addition, on retrieval tasks, Sieve leads to a significant improvement of +2.7% and 4.5% on medium and large scale respectively.",cs.CV,['cs.CV'] +Streaming Dense Video Captioning,Xingyi Zhou · Anurag Arnab · Shyamal Buch · Shen Yan · Austin Myers · Xuehan Xiong · Arsha Nagrani · Cordelia Schmid, ,https://arxiv.org/abs/2404.01297,,2404.01297.pdf,Streaming Dense Video Captioning,"An ideal model for dense video captioning -- predicting captions localized +temporally in a video -- should be able to handle long input videos, predict +rich, detailed textual descriptions, and be able to produce outputs before +processing the entire video. Current state-of-the-art models, however, process +a fixed number of downsampled frames, and make a single full prediction after +seeing the whole video. We propose a streaming dense video captioning model +that consists of two novel components: First, we propose a new memory module, +based on clustering incoming tokens, which can handle arbitrarily long videos +as the memory is of a fixed size. Second, we develop a streaming decoding +algorithm that enables our model to make predictions before the entire video +has been processed. Our model achieves this streaming ability, and +significantly improves the state-of-the-art on three dense video captioning +benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at +https://github.com/google-research/scenic.",cs.CV,['cs.CV'] +DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes,Hao Yan · Zhihui Ke · Xiaobo Zhou · Tie Qiu · Xidong Shi · DaDong Jiang,https://haoyan14.github.io/DS-NeRV/,https://arxiv.org/abs/2403.15679,,,DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes,"Implicit neural representations for video (NeRV) have recently become a novel +way for high-quality video representation. However, existing works employ a +single network to represent the entire video, which implicitly confuse static +and dynamic information. This leads to an inability to effectively compress the +redundant static information and lack the explicitly modeling of global +temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV, +which decomposes videos into sparse learnable static codes and dynamic codes +without the need for explicit optical flow or residual supervision. By setting +different sampling rates for two codes and applying weighted sum and +interpolation sampling methods, DS-NeRV efficiently utilizes redundant static +information while maintaining high-frequency details. Additionally, we design a +cross-channel attention-based (CCA) fusion module to efficiently fuse these two +codes for frame decoding. Our approach achieves a high quality reconstruction +of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic +codes representation and outperforms existing NeRV methods in many downstream +tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.",cs.CV,"['cs.CV', 'cs.MM']" +SANeRF-HQ: Segment Anything for NeRF in High Quality,Yichen Liu · Benran Hu · Chi-Keung Tang · Yu-Wing Tai, ,https://arxiv.org/abs/2312.01531,,2312.01531.pdf,SANeRF-HQ: Segment Anything for NeRF in High Quality,"Recently, the Segment Anything Model (SAM) has showcased remarkable +capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has +gained popularity as a method for various 3D problems beyond novel view +synthesis. Though there exist initial attempts to incorporate these two methods +into 3D segmentation, they face the challenge of accurately and consistently +segmenting objects in complex scenarios. In this paper, we introduce the +Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high-quality +3D segmentation of any target object in a given scene. SANeRF-HQ utilizes SAM +for open-world object segmentation guided by user-supplied prompts, while +leveraging NeRF to aggregate information from different viewpoints. To overcome +the aforementioned challenges, we employ density field and RGB similarity to +enhance the accuracy of segmentation boundary during the aggregation. +Emphasizing on segmentation accuracy, we evaluate our method on multiple NeRF +datasets where high-quality ground-truths are available or manually annotated. +SANeRF-HQ shows a significant quality improvement over state-of-the-art methods +in NeRF object segmentation, provides higher flexibility for object +localization, and enables more consistent object segmentation across multiple +views. Results and code are available at the project site: +https://lyclyc52.github.io/SANeRF-HQ/.",cs.CV,['cs.CV'] +\emph{RealCustom}: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization,Mengqi Huang · Zhendong Mao · Mingcong Liu · Qian HE · Yongdong Zhang,https://corleone-huang.github.io/realcustom/,https://arxiv.org/abs/2403.00483,,2403.00483.pdf,RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization,"Text-to-image customization, which aims to synthesize text-driven images for +the given subjects, has recently revolutionized content creation. Existing +works follow the pseudo-word paradigm, i.e., represent the given subjects as +pseudo-words and then compose them with the given text. However, the inherent +entangled influence scope of pseudo-words with the given text results in a +dual-optimum paradox, i.e., the similarity of the given subjects and the +controllability of the given text could not be optimal simultaneously. We +present RealCustom that, for the first time, disentangles similarity from +controllability by precisely limiting subject influence to relevant parts only, +achieved by gradually narrowing real text word from its general connotation to +the specific subject and using its cross-attention to distinguish relevance. +Specifically, RealCustom introduces a novel ""train-inference"" decoupled +framework: (1) during training, RealCustom learns general alignment between +visual conditions to original textual conditions by a novel adaptive scoring +module to adaptively modulate influence quantity; (2) during inference, a novel +adaptive mask guidance strategy is proposed to iteratively update the influence +scope and influence quantity of the given subjects to gradually narrow the +generation of the real text word. Comprehensive experiments demonstrate the +superior real-time customization ability of RealCustom in the open domain, +achieving both unprecedented similarity of the given subjects and +controllability of the given text for the first time. The project page is +https://corleone-huang.github.io/realcustom/.",cs.CV,['cs.CV'] +Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification,Sravanti Addepalli · Ashish Asokan · Lakshay Sharma · R. Venkatesh Babu, ,https://arxiv.org/abs/2310.08255,,2310.08255.pdf,Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification,"Vision-Language Models (VLMs) such as CLIP are trained on large amounts of +image-text pairs, resulting in remarkable generalization across several data +distributions. However, in several cases, their expensive training and data +collection/curation costs do not justify the end application. This motivates a +vendor-client paradigm, where a vendor trains a large-scale VLM and grants only +input-output access to clients on a pay-per-query basis in a black-box setting. +The client aims to minimize inference cost by distilling the VLM to a student +model using the limited available task-specific data, and further deploying +this student model in the downstream application. While naive distillation +largely improves the In-Domain (ID) accuracy of the student, it fails to +transfer the superior out-of-distribution (OOD) generalization of the VLM +teacher using the limited available labeled images. To mitigate this, we +propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which +first aligns the vision and language modalities of the teacher model with the +vision modality of a pre-trained student model, and further distills the +aligned VLM representations to the student. This maximally retains the +pre-trained features of the student, while also incorporating the rich +representations of the VLM image encoder and the superior generalization of the +text embeddings. The proposed approach achieves state-of-the-art results on the +standard Domain Generalization benchmarks in a black-box teacher setting as +well as a white-box setting where the weights of the VLM are accessible.",cs.CV,['cs.CV'] +TransLoc4D: Transformer-based 4D Radar Place Recognition,Guohao Peng · Heshan Li · Yangyang Zhao · Jun Zhang · Zhenyu Wu · Pengyu Zheng · Danwei Wang, ,https://arxiv.org/abs/2401.13082,,2401.13082.pdf,PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion,"Visual place recognition is a challenging task in the field of computer +vision, and autonomous robotics and vehicles, which aims to identify a location +or a place from visual inputs. Contemporary methods in visual place recognition +employ convolutional neural networks and utilize every region within the image +for the place recognition task. However, the presence of dynamic and +distracting elements in the image may impact the effectiveness of the place +recognition process. Therefore, it is meaningful to focus on task-relevant +regions of the image for improved recognition. In this paper, we present +PlaceFormer, a novel transformer-based approach for visual place recognition. +PlaceFormer employs patch tokens from the transformer to create global image +descriptors, which are then used for image retrieval. To re-rank the retrieved +images, PlaceFormer merges the patch tokens from the transformer to form +multi-scale patches. Utilizing the transformer's self-attention mechanism, it +selects patches that correspond to task-relevant areas in an image. These +selected patches undergo geometric verification, generating similarity scores +across different patch sizes. Subsequently, spatial scores from each patch size +are fused to produce a final similarity score. This score is then used to +re-rank the images initially retrieved using global image descriptors. +Extensive experiments on benchmark datasets demonstrate that PlaceFormer +outperforms several state-of-the-art methods in terms of accuracy and +computational efficiency, requiring less time and memory.",cs.CV,"['cs.CV', 'cs.RO']" +Domain Gap Embeddings for Generative Dataset Augmentation,Yinong Wang · Younjoon Chung · Chen Henry Wu · Fernando De la Torre, ,https://arxiv.org/abs/2312.05387,,2312.05387.pdf,Cross Domain Generative Augmentation: Domain Generalization with Latent Diffusion Models,"Despite the huge effort in developing novel regularizers for Domain +Generalization (DG), adding simple data augmentation to the vanilla ERM which +is a practical implementation of the Vicinal Risk Minimization principle (VRM) +\citep{chapelle2000vicinal} outperforms or stays competitive with many of the +proposed regularizers. The VRM reduces the estimation error in ERM by replacing +the point-wise kernel estimates with a more precise estimation of true data +distribution that reduces the gap between data points \textbf{within each +domain}. However, in the DG setting, the estimation error of true data +distribution by ERM is mainly caused by the distribution shift \textbf{between +domains} which cannot be fully addressed by simple data augmentation techniques +within each domain. Inspired by this limitation of VRM, we propose a novel data +augmentation named Cross Domain Generative Augmentation (CDGA) that replaces +the pointwise kernel estimates in ERM with new density estimates in the +\textbf{vicinity of domain pairs} so that the gap between domains is further +reduced. To this end, CDGA, which is built upon latent diffusion models (LDM), +generates synthetic images to fill the gap between all domains and as a result, +reduces the non-iidness. We show that CDGA outperforms SOTA DG methods under +the Domainbed benchmark. To explain the effectiveness of CDGA, we generate more +than 5 Million synthetic images and perform extensive ablation studies +including data scaling laws, distribution visualization, domain shift +quantification, adversarial robustness, and loss landscape analysis.",cs.LG,['cs.LG'] +Detours for Navigating Instructional Videos,Kumar Ashutosh · Zihui Xue · Tushar Nagarajan · Kristen Grauman, ,https://arxiv.org/abs/2401.01823,,2401.01823.pdf,Detours for Navigating Instructional Videos,"We introduce the video detours problem for navigating instructional videos. +Given a source video and a natural language query asking to alter the how-to +video's current path of execution in a certain way, the goal is to find a +related ''detour video'' that satisfies the requested alteration. To address +this challenge, we propose VidDetours, a novel video-language approach that +learns to retrieve the targeted temporal segments from a large repository of +how-to's using video-and-text conditioned queries. Furthermore, we devise a +language-based pipeline that exploits how-to video narration text to create +weakly supervised training data. We demonstrate our idea applied to the domain +of how-to cooking videos, where a user can detour from their current recipe to +find steps with alternate ingredients, tools, and techniques. Validating on a +ground truth annotated dataset of 16K samples, we show our model's significant +improvements over best available methods for video retrieval and question +answering, with recall rates exceeding the state of the art by 35%.",cs.CV,['cs.CV'] +Iterated Learning Improves Compositionality in Large Vision-Language Models,Chenhao Zheng · Jieyu Zhang · Aniruddha Kembhavi · Ranjay Krishna, ,https://arxiv.org/abs/2404.02145,,2404.02145.pdf,Iterated Learning Improves Compositionality in Large Vision-Language Models,"A fundamental characteristic common to both human vision and natural language +is their compositional nature. Yet, despite the performance gains contributed +by large vision and language pretraining, recent investigations find that +most-if not all-our state-of-the-art vision-language models struggle at +compositionality. They are unable to distinguish between images of "" a girl in +white facing a man in black"" and ""a girl in black facing a man in white"". +Moreover, prior work suggests that compositionality doesn't arise with scale: +larger model sizes or training data don't help. This paper develops a new +iterated training algorithm that incentivizes compositionality. We draw on +decades of cognitive science research that identifies cultural transmission-the +need to teach a new generation-as a necessary inductive prior that incentivizes +humans to develop compositional languages. Specifically, we reframe +vision-language contrastive learning as the Lewis Signaling Game between a +vision agent and a language agent, and operationalize cultural transmission by +iteratively resetting one of the agent's weights during training. After every +iteration, this training paradigm induces representations that become ""easier +to learn"", a property of compositional languages: e.g. our model trained on +CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the +SugarCrepe benchmark.",cs.CV,['cs.CV'] +Contrastive Mean-Shift Learning for Generalized Category Discovery,Sua Choi · Dahyun Kang · Minsu Cho, ,https://arxiv.org/abs/2404.09451,,2404.09451.pdf,Contrastive Mean-Shift Learning for Generalized Category Discovery,"We address the problem of generalized category discovery (GCD) that aims to +partition a partially labeled collection of images; only a small part of the +collection is labeled and the total number of target classes is unknown. To +address this generalized image clustering problem, we revisit the mean-shift +algorithm, i.e., a classic, powerful technique for mode seeking, and +incorporate it into a contrastive learning framework. The proposed method, +dubbed Contrastive Mean-Shift (CMS) learning, trains an image encoder to +produce representations with better clustering properties by an iterative +process of mean shift and contrastive update. Experiments demonstrate that our +method, both in settings with and without the total number of clusters being +known, achieves state-of-the-art performance on six public GCD benchmarks +without bells and whistles.",cs.CV,['cs.CV'] +Volumetric Environment Representation for Vision-Language Navigation,Liu · Wenguan Wang · Yi Yang, ,https://arxiv.org/abs/2403.14158v1,,2403.14158v1.pdf,Volumetric Environment Representation for Vision-Language Navigation,"Vision-language navigation (VLN) requires an agent to navigate through an 3D +environment based on visual observations and natural language instructions. It +is clear that the pivotal factor for successful navigation lies in the +comprehensive scene understanding. Previous VLN agents employ monocular +frameworks to extract 2D features of perspective views directly. Though +straightforward, they struggle for capturing 3D geometry and semantics, leading +to a partial and incomplete environment representation. To achieve a +comprehensive 3D representation with fine-grained details, we introduce a +Volumetric Environment Representation (VER), which voxelizes the physical world +into structured 3D cells. For each cell, VER aggregates multi-view 2D features +into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature +extraction and multi-task learning for VER, our agent predicts 3D occupancy, 3D +room layout, and 3D bounding boxes jointly. Based on online collected VERs, our +agent performs volume state estimation and builds episodic memory for +predicting the next step. Experimental results show our environment +representations from multi-task learning lead to evident performance gains on +VLN. Our model achieves state-of-the-art performance across VLN benchmarks +(R2R, REVERIE, and R4R).",cs.CV,['cs.CV'] +DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data,Qihao Liu · Yi Zhang · Song Bai · Adam Kortylewski · Alan L. Yuille, ,https://arxiv.org/abs/2405.14832,,2405.14832.pdf,Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer,"Generating high-quality 3D assets from text and images has long been +challenging, primarily due to the absence of scalable 3D representations +capable of capturing intricate geometry distributions. In this work, we +introduce Direct3D, a native 3D generative model scalable to in-the-wild input +images, without requiring a multiview diffusion model or SDS optimization. Our +approach comprises two primary components: a Direct 3D Variational Auto-Encoder +(D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently +encodes high-resolution 3D shapes into a compact and continuous latent triplane +space. Notably, our method directly supervises the decoded geometry using a +semi-continuous surface sampling strategy, diverging from previous methods +relying on rendered images as supervision signals. D3D-DiT models the +distribution of encoded 3D latents and is specifically designed to fuse +positional information from the three feature maps of the triplane latent, +enabling a native 3D generative model scalable to large-scale 3D datasets. +Additionally, we introduce an innovative image-to-3D generation pipeline +incorporating semantic and pixel-level image conditions, allowing the model to +produce 3D shapes consistent with the provided conditional image input. +Extensive experiments demonstrate the superiority of our large-scale +pre-trained Direct3D over previous image-to-3D approaches, achieving +significantly better generation quality and generalization ability, thus +establishing a new state-of-the-art for 3D content creation. Project page: +https://nju-3dv.github.io/projects/Direct3D/.",cs.CV,['cs.CV'] +One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models,Lin Li · Haoyan Guan · Jianing Qiu · Michael Spratling,https://github.com/TreeLLi/APT,https://arxiv.org/abs/2403.01849,,2403.01849.pdf,One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models,"Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having +remarkable generalization ability, are highly vulnerable to adversarial +examples. This work studies the adversarial robustness of VLMs from the novel +perspective of the text prompt instead of the extensively studied model weights +(frozen in this work). We first show that the effectiveness of both adversarial +attack and defense are sensitive to the used text prompt. Inspired by this, we +propose a method to improve resilience to adversarial attacks by learning a +robust text prompt for VLMs. The proposed method, named Adversarial Prompt +Tuning (APT), is effective while being both computationally and data efficient. +Extensive experiments are conducted across 15 datasets and 4 data sparsity +schemes (from 1-shot to full training data settings) to show APT's superiority +over hand-engineered prompts and other state-of-the-art adaption methods. APT +demonstrated excellent abilities in terms of the in-distribution performance +and the generalization under input distribution shift and across datasets. +Surprisingly, by simply adding one learned word to the prompts, APT can +significantly boost the accuracy and robustness (epsilon=4/255) over the +hand-engineered prompts by +13% and +8.5% on average respectively. The +improvement further increases, in our most effective setting, to +26.4% for +accuracy and +16.7% for robustness. Code is available at +https://github.com/TreeLLi/APT.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI,Chong Wang · Lanqing Guo · Yufei Wang · Hao Cheng · Yi Yu · Bihan Wen,https://github.com/ChongWang1024/PDAC,https://arxiv.org/abs/2403.10064,,2403.10064.pdf,Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI,"Deep unfolding networks (DUN) have emerged as a popular iterative framework +for accelerated magnetic resonance imaging (MRI) reconstruction. However, +conventional DUN aims to reconstruct all the missing information within the +entire null space in each iteration. Thus it could be challenging when dealing +with highly ill-posed degradation, usually leading to unsatisfactory +reconstruction. In this work, we propose a Progressive Divide-And-Conquer +(PDAC) strategy, aiming to break down the subsampling process in the actual +severe degradation and thus perform reconstruction sequentially. Starting from +decomposing the original maximum-a-posteriori problem of accelerated MRI, we +present a rigorous derivation of the proposed PDAC framework, which could be +further unfolded into an end-to-end trainable network. Specifically, each +iterative stage in PDAC focuses on recovering a distinct moderate degradation +according to the decomposition. Furthermore, as part of the PDAC iteration, +such decomposition is adaptively learned as an auxiliary task through a +degradation predictor which provides an estimation of the decomposed sampling +mask. Following this prediction, the sampling mask is further integrated via a +severity conditioning module to ensure awareness of the degradation severity at +each stage. Extensive experiments demonstrate that our proposed method achieves +superior performance on the publicly available fastMRI and Stanford2D FSE +datasets in both multi-coil and single-coil settings.",eess.IV,"['eess.IV', 'cs.CV']" +Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World,Wen Yin · Jian Lou · Pan Zhou · Yulai Xie · Dan Feng · Yuhua Sun · Tailai Zhang · Lichao Sun, ,http://export.arxiv.org/abs/2404.19417,,2404.19417.pdf,Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World,"Backdoor attacks have been well-studied in visible light object detection +(VLOD) in recent years. However, VLOD can not effectively work in dark and +temperature-sensitive scenarios. Instead, thermal infrared object detection +(TIOD) is the most accessible and practical in such environments. In this +paper, our team is the first to investigate the security vulnerabilities +associated with TIOD in the context of backdoor attacks, spanning both the +digital and physical realms. We introduce two novel types of backdoor attacks +on TIOD, each offering unique capabilities: Object-affecting Attack and +Range-affecting Attack. We conduct a comprehensive analysis of key factors +influencing trigger design, which include temperature, size, material, and +concealment. These factors, especially temperature, significantly impact the +efficacy of backdoor attacks on TIOD. A thorough understanding of these factors +will serve as a foundation for designing physical triggers and temperature +controlling experiments. Our study includes extensive experiments conducted in +both digital and physical environments. In the digital realm, we evaluate our +approach using benchmark datasets for TIOD, achieving an Attack Success Rate +(ASR) of up to 98.21%. In the physical realm, we test our approach in two +real-world settings: a traffic intersection and a parking lot, using a thermal +infrared camera. Here, we attain an ASR of up to 98.38%.",cs.CV,['cs.CV'] +Diffusion Model Alignment Using Direct Preference Optimization,Bram Wallace · Meihua Dang · Rafael Rafailov · Linqi Zhou · Aaron Lou · Senthil Purushwalkam · Stefano Ermon · Caiming Xiong · Shafiq Joty · Nikhil Naik, ,https://arxiv.org/abs/2311.12908,,2311.12908.pdf,Diffusion Model Alignment Using Direct Preference Optimization,"Large language models (LLMs) are fine-tuned using human comparison data with +Reinforcement Learning from Human Feedback (RLHF) methods to make them better +aligned with users' preferences. In contrast to LLMs, human preference learning +has not been widely explored in text-to-image diffusion models; the best +existing approach is to fine-tune a pretrained model using carefully curated +high quality images and captions to improve visual appeal and text alignment. +We propose Diffusion-DPO, a method to align diffusion models to human +preferences by directly optimizing on human comparison data. Diffusion-DPO is +adapted from the recently developed Direct Preference Optimization (DPO), a +simpler alternative to RLHF which directly optimizes a policy that best +satisfies human preferences under a classification objective. We re-formulate +DPO to account for a diffusion model notion of likelihood, utilizing the +evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic +dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model +of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with +Diffusion-DPO. Our fine-tuned base model significantly outperforms both base +SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement +model in human evaluation, improving visual appeal and prompt alignment. We +also develop a variant that uses AI feedback and has comparable performance to +training on human preferences, opening the door for scaling of diffusion model +alignment methods.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']" +LAMP: Learn A Motion Pattern for Few-Shot Video Generation,Rui-Qi Wu · Liangyu Chen · Tong Yang · Chun-Le Guo · Chongyi Li · Xiangyu Zhang,https://rq-wu.github.io/projects/LAMP/index.html,https://arxiv.org/abs/2310.10769,,2310.10769.pdf,LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation,"With the impressive progress in diffusion-based text-to-image generation, +extending such powerful generative ability to text-to-video raises enormous +attention. Existing methods either require large-scale text-video pairs and a +large number of training resources or learn motions that are precisely aligned +with template videos. It is non-trivial to balance a trade-off between the +degree of generation freedom and the resource costs for video generation. In +our study, we present a few-shot-based tuning framework, LAMP, which enables +text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos +on a single GPU. Specifically, we design a first-frame-conditioned pipeline +that uses an off-the-shelf text-to-image model for content generation so that +our tuned video diffusion model mainly focuses on motion learning. The +well-developed text-to-image techniques can provide visually pleasing and +diverse content as generation conditions, which highly improves video quality +and generation freedom. To capture the features of temporal dimension, we +expand the pretrained 2D convolution layers of the T2I model to our novel +temporal-spatial motion learning layers and modify the attention blocks to the +temporal level. Additionally, we develop an effective inference trick, +shared-noise sampling, which can improve the stability of videos with +computational costs. Our method can also be flexibly applied to other tasks, +e.g. real-world image animation and video editing. Extensive experiments +demonstrate that LAMP can effectively learn the motion pattern on limited data +and generate high-quality videos. The code and models are available at +https://rq-wu.github.io/projects/LAMP.",cs.CV,['cs.CV'] +Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation,Xiao Lin · Wenfei Yang · Yuan Gao · Tianzhu Zhang, ,https://arxiv.org/abs/2403.19527,,2403.19527.pdf,Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation,"Category-level 6D object pose estimation aims to estimate the rotation, +translation and size of unseen instances within specific categories. In this +area, dense correspondence-based methods have achieved leading performance. +However, they do not explicitly consider the local and global geometric +information of different instances, resulting in poor generalization ability to +unseen instances with significant shape variations. To deal with this problem, +we propose a novel Instance-Adaptive and Geometric-Aware Keypoint Learning +method for category-level 6D object pose estimation (AG-Pose), which includes +two key designs: (1) The first design is an Instance-Adaptive Keypoint +Detection module, which can adaptively detect a set of sparse keypoints for +various instances to represent their geometric structures. (2) The second +design is a Geometric-Aware Feature Aggregation module, which can efficiently +integrate the local and global geometric information into keypoint features. +These two modules can work together to establish robust keypoint-level +correspondences for unseen instances, thus enhancing the generalization ability +of the model.Experimental results on CAMERA25 and REAL275 datasets show that +the proposed AG-Pose outperforms state-of-the-art methods by a large margin +without category-specific shape priors.",cs.CV,['cs.CV'] +RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback,Tianyu Yu · Yuan Yao · Haoye Zhang · Taiwen He · Yifeng Han · Ganqu Cui · Jinyi Hu · Zhiyuan Liu · Hai-Tao Zheng · Maosong Sun,https://github.com/RLHF-V/RLHF-V,https://arxiv.org/abs/2312.00849,,2312.00849.pdf,RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback,"Multimodal Large Language Models (MLLMs) have recently demonstrated +impressive capabilities in multimodal understanding, reasoning, and +interaction. However, existing MLLMs prevalently suffer from serious +hallucination problems, generating text that is not factually grounded in +associated images. The problem makes existing MLLMs untrustworthy and thus +impractical in real-world (especially high-stakes) applications. To address the +challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior +alignment from fine-grained correctional human feedback. Specifically, RLHF-V +collects human preference in the form of segment-level corrections on +hallucinations, and performs dense direct preference optimization over the +human feedback. Comprehensive experiments on five benchmarks in both automatic +and human evaluation show that, RLHF-V can enable substantially more +trustworthy MLLM behaviors with promising data and computation efficiency. +Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the +hallucination rate of the base MLLM by 34.8%, outperforming the concurrent +LLaVA-RLHF trained on 10k annotated data. The final model achieves +state-of-the-art performance in trustworthiness among open-source MLLMs, and +shows better robustness than GPT-4V in preventing hallucinations aroused from +over-generalization. We open-source our code, model, and data at +https://github.com/RLHF-V/RLHF-V.",cs.CL,"['cs.CL', 'cs.CV']" +"WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concept",Yong Hyun Ahn · Hyeon Bae Kim · Seong Tae Kim, ,https://arxiv.org/abs/2402.18956,,2402.18956.pdf,"WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts","Recent advancements in neural networks have showcased their remarkable +capabilities across various domains. Despite these successes, the ""black box"" +problem still remains. Addressing this, we propose a novel framework, WWW, that +offers the 'what', 'where', and 'why' of the neural network decisions in +human-understandable terms. Specifically, WWW utilizes adaptive selection for +concept discovery, employing adaptive cosine similarity and thresholding +techniques to effectively explain 'what'. To address the 'where' and 'why', we +proposed a novel combination of neuron activation maps (NAMs) with Shapley +values, generating localized concept maps and heatmaps for individual inputs. +Furthermore, WWW introduces a method for predicting uncertainty, leveraging +heatmap similarities to estimate 'how' reliable the prediction is. Experimental +evaluations of WWW demonstrate superior performance in both quantitative and +qualitative metrics, outperforming existing methods in interpretability. WWW +provides a unified solution for explaining 'what', 'where', and 'why', +introducing a method for localized explanations from global interpretations and +offering a plug-and-play solution adaptable to various architectures.",cs.CV,['cs.CV'] +Towards Variable and Coordinated Holistic Co-Speech Motion Generation,Yifei Liu · Qiong Cao · Yandong Wen · Huaiguang Jiang · Changxing Ding, ,https://arxiv.org/abs/2404.00368,,2404.00368.pdf,Towards Variable and Coordinated Holistic Co-Speech Motion Generation,"This paper addresses the problem of generating lifelike holistic co-speech +motions for 3D avatars, focusing on two key aspects: variability and +coordination. Variability allows the avatar to exhibit a wide range of motions +even with similar speech content, while coordination ensures a harmonious +alignment among facial expressions, hand gestures, and body poses. We aim to +achieve both with ProbTalk, a unified probabilistic framework designed to +jointly model facial, hand, and body movements in speech. ProbTalk builds on +the variational autoencoder (VAE) architecture and incorporates three core +designs. First, we introduce product quantization (PQ) to the VAE, which +enriches the representation of complex holistic motion. Second, we devise a +novel non-autoregressive model that embeds 2D positional encoding into the +product-quantized representation, thereby preserving essential structure +information of the PQ codes. Last, we employ a secondary stage to refine the +preliminary prediction, further sharpening the high-frequency details. Coupling +these three designs enables ProbTalk to generate natural and diverse holistic +co-speech motions, outperforming several state-of-the-art methods in +qualitative and quantitative evaluations, particularly in terms of realism. Our +code and model will be released for research purposes at +https://feifeifeiliu.github.io/probtalk/.",cs.CV,['cs.CV'] +MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images,Junwen Huang · Hao Yu · Kuan-Ting Yu · Nassir Navab · Slobodan Ilic · Benjamin Busam, ,https://arxiv.org/abs/2403.01517,,2403.01517.pdf,MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images,"Recent learning methods for object pose estimation require resource-intensive +training for each individual object instance or category, hampering their +scalability in real applications when confronted with previously unseen +objects. In this paper, we propose MatchU, a Fuse-Describe-Match strategy for +6D pose estimation from RGB-D images. MatchU is a generic approach that fuses +2D texture and 3D geometric cues for 6D pose prediction of unseen objects. We +rely on learning geometric 3D descriptors that are rotation-invariant by +design. By encoding pose-agnostic geometry, the learned descriptors naturally +generalize to unseen objects and capture symmetries. To tackle ambiguous +associations using 3D geometry only, we fuse additional RGB information into +our descriptor. This is achieved through a novel attention-based mechanism that +fuses cross-modal information, together with a matching loss that leverages the +latent space learned from RGB data to guide the descriptor learning process. +Extensive experiments reveal the generalizability of both the RGB-D fusion +strategy as well as the descriptor efficacy. Benefiting from the novel designs, +MatchU surpasses all existing methods by a significant margin in terms of both +accuracy and speed, even without the requirement of expensive re-training or +rendering.",cs.CV,['cs.CV'] +SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining,Chull Hwan Song · Taebaek Hwang · Jooyoung Yoon · Shunghyun Choi · Yeong Hyeon Gu, ,https://arxiv.org/abs/2404.01156,,2404.01156.pdf,SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining,"Vision-language models (VLMs) have made significant strides in cross-modal +understanding through large-scale paired datasets. However, in fashion domain, +datasets often exhibit a disparity between the information conveyed in image +and text. This issue stems from datasets containing multiple images of a single +fashion item all paired with one text, leading to cases where some textual +details are not visible in individual images. This mismatch, particularly when +non-co-occurring elements are masked, undermines the training of conventional +VLM objectives like Masked Language Modeling and Masked Image Modeling, thereby +hindering the model's ability to accurately align fine-grained visual and +textual features. Addressing this problem, we propose Synchronized attentional +Masking (SyncMask), which generate masks that pinpoint the image patches and +word tokens where the information co-occur in both image and text. This +synchronization is accomplished by harnessing cross-attentional features +obtained from a momentum model, ensuring a precise alignment between the two +modalities. Additionally, we enhance grouped batch sampling with semi-hard +negatives, effectively mitigating false negative issues in Image-Text Matching +and Image-Text Contrastive learning objectives within fashion datasets. Our +experiments demonstrate the effectiveness of the proposed approach, +outperforming existing methods in three downstream tasks.",cs.CV,"['cs.CV', 'cs.AI']" +Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception,Junwen He · Yifan Wang · Lijun Wang · Huchuan Lu · Bin Luo · Jun-Yan He · Jin-Peng Lan · Xuansong Xie, ,https://arxiv.org/abs/2403.02969,,2403.02969.pdf,Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception,"Multimodal Large Language Model (MLLMs) leverages Large Language Models as a +cognitive framework for diverse visual-language tasks. Recent efforts have been +made to equip MLLMs with visual perceiving and grounding capabilities. However, +there still remains a gap in providing fine-grained pixel-level perceptions and +extending interactions beyond text-specific inputs. In this work, we propose +{\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object +perceptions and natural language descriptions from multi-modality references, +such as texts, boxes, images, or audio. This innovation empowers users with +greater flexibility to engage with the model beyond textual and regional +prompts, without modality-specific designs. Through our proposed refocusing +mechanism, the generated grounding output is guided to better focus on the +referenced object, implicitly incorporating additional pixel-level supervision. +This simple modification utilizes attention scores generated during the +inference of LLM, eliminating the need for extra computations while exhibiting +performance enhancements in both grounding masks and referring expressions. +With only publicly available training data, our model achieves state-of-the-art +results across multiple benchmarks, including diverse modality referring +segmentation and region-level referring expression generation.",cs.CV,['cs.CV'] +Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes,Haobin Duan · Miao Wang · Yanxun Li · Yong-Liang Yang, ,https://arxiv.org/abs/2311.15637,,2311.15637.pdf,Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes,"We present Neural 3D Strokes, a novel technique to generate stylized images +of a 3D scene at arbitrary novel views from multi-view 2D images. Different +from existing methods which apply stylization to trained neural radiance fields +at the voxel level, our approach draws inspiration from image-to-painting +methods, simulating the progressive painting process of human artwork with +vector strokes. We develop a palette of stylized 3D strokes from basic +primitives and splines, and consider the 3D scene stylization task as a +multi-view reconstruction process based on these 3D stroke primitives. Instead +of directly searching for the parameters of these 3D strokes, which would be +too costly, we introduce a differentiable renderer that allows optimizing +stroke parameters using gradient descent, and propose a training scheme to +alleviate the vanishing gradient issue. The extensive evaluation demonstrates +that our approach effectively synthesizes 3D scenes with significant geometric +and aesthetic stylization while maintaining a consistent appearance across +different views. Our method can be further integrated with style loss and +image-text contrastive models to extend its applications, including color +transfer and text-driven 3D scene drawing. Results and code are available at +http://buaavrcg.github.io/Neural3DStrokes.",cs.CV,"['cs.CV', 'cs.GR']" +A Theory of Joint Light and Heat Transport for Lambertian Scenes,Mani Ramanagopal · Sriram Narayanan · Aswin C. Sankaranarayanan · Srinivasa G. Narasimhan, ,,https://dl.acm.org/doi/10.1145/3596711.3596745,,,,,nan +Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences,Seungwook Kim · Kejie Li · Xueqing Deng · Yichun Shi · Minsu Cho · Peng Wang, ,https://arxiv.org/abs/2404.10603,,2404.10603.pdf,Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences,"Leveraging multi-view diffusion models as priors for 3D optimization have +alleviated the problem of 3D consistency, e.g., the Janus face problem or the +content drift problem, in zero-shot text-to-3D models. However, the 3D +geometric fidelity of the output remains an unresolved issue; albeit the +rendered 2D views are realistic, the underlying geometry may contain errors +such as unreasonable concavities. In this work, we propose CorrespondentDream, +an effective method to leverage annotation-free, cross-view correspondences +yielded from the diffusion U-Net to provide additional 3D prior to the NeRF +optimization process. We find that these correspondences are strongly +consistent with human perception, and by adopting it in our loss design, we are +able to produce NeRF models with geometries that are more coherent with common +sense, e.g., more smoothed object surface, yielding higher 3D fidelity. We +demonstrate the efficacy of our approach through various comparative +qualitative results and a solid user study.",cs.CV,['cs.CV'] +Generative Region-Language Pretraining for Open-Ended Object Detection,Chuang Lin · Yi Jiang · Lizhen Qu · Zehuan Yuan · Jianfei Cai, ,https://arxiv.org/abs/2403.10191,,2403.10191.pdf,Generative Region-Language Pretraining for Open-Ended Object Detection,"In recent research, significant attention has been devoted to the +open-vocabulary object detection task, aiming to generalize beyond the limited +number of classes labeled during training and detect objects described by +arbitrary category names at inference. Compared with conventional object +detection, open vocabulary object detection largely extends the object +detection categories. However, it relies on calculating the similarity between +image regions and a set of arbitrary category names with a pretrained +vision-and-language model. This implies that, despite its open-set nature, the +task still needs the predefined object categories during the inference stage. +This raises the question: What if we do not have exact knowledge of object +categories during inference? In this paper, we call such a new setting as +generative open-ended object detection, which is a more general and practical +problem. To address it, we formulate object detection as a generative problem +and propose a simple framework named GenerateU, which can detect dense objects +and generate their names in a free-form way. Particularly, we employ Deformable +DETR as a region proposal generator with a language model translating visual +regions to object names. To assess the free-form object detection task, we +introduce an evaluation method designed to quantitatively measure the +performance of generative outcomes. Extensive experiments demonstrate strong +zero-shot detection performance of our GenerateU. For example, on the LVIS +dataset, our GenerateU achieves comparable results to the open-vocabulary +object detection method GLIP, even though the category names are not seen by +GenerateU during inference. Code is available at: https:// +github.com/FoundationVision/GenerateU .",cs.CV,['cs.CV'] +GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding,Chengyao Wang · Li Jiang · Xiaoyang Wu · Zhuotao Tian · Bohao Peng · Hengshuang Zhao · Jiaya Jia,https://github.com/dvlab-research/GroupContrast,https://arxiv.org/abs/2403.09639,,2403.09639.pdf,GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding,"Self-supervised 3D representation learning aims to learn effective +representations from large-scale unlabeled point clouds. Most existing +approaches adopt point discrimination as the pretext task, which assigns +matched points in two distinct views as positive pairs and unmatched points as +negative pairs. However, this approach often results in semantically identical +points having dissimilar representations, leading to a high number of false +negatives and introducing a ""semantic conflict"" problem. To address this issue, +we propose GroupContrast, a novel approach that combines segment grouping and +semantic-aware contrastive learning. Segment grouping partitions points into +semantically meaningful regions, which enhances semantic coherence and provides +semantic guidance for the subsequent contrastive representation learning. +Semantic-aware contrastive learning augments the semantic information extracted +from segment grouping and helps to alleviate the issue of ""semantic conflict"". +We conducted extensive experiments on multiple 3D scene understanding tasks. +The results demonstrate that GroupContrast learns semantically meaningful +representations and achieves promising transfer learning performance.",cs.CV,['cs.CV'] +Improved Visual Grounding through Self-Consistent Explanations,Ruozhen He · Paola Cascante-Bonilla · Ziyan Yang · Alex Berg · Vicente Ordonez,https://catherine-r-he.github.io/SelfEQ/,https://arxiv.org/abs/2312.04554v1,,2312.04554v1.pdf,Improved Visual Grounding through Self-Consistent Explanations,"Vision-and-language models trained to match images with text can be combined +with visual explanation methods to point to the locations of specific objects +in an image. Our work shows that the localization --""grounding""-- abilities of +these models can be further improved by finetuning for self-consistent visual +explanations. We propose a strategy for augmenting existing text-image datasets +with paraphrases using a large language model, and SelfEQ, a weakly-supervised +strategy on visual explanation maps for paraphrases that encourages +self-consistency. Specifically, for an input textual phrase, we attempt to +generate a paraphrase and finetune the model so that the phrase and paraphrase +map to the same region in the image. We posit that this both expands the +vocabulary that the model is able to handle, and improves the quality of the +object locations highlighted by gradient-based visual explanation methods (e.g. +GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k, +ReferIt, and RefCOCO+ over a strong baseline method and several prior works. +Particularly, comparing to other methods that do not use any type of box +annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%), +67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on +RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on +average).",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution,Shangchen Zhou · Peiqing Yang · Jianyi Wang · Yihang Luo · Chen Change Loy, ,https://arxiv.org/abs/2312.06640,,2312.06640.pdf,Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution,"Text-based diffusion models have exhibited remarkable success in generation +and editing, showing great promise for enhancing visual content with their +generative prior. However, applying these models to video super-resolution +remains challenging due to the high demands for output fidelity and temporal +consistency, which is complicated by the inherent randomness in diffusion +models. Our study introduces Upscale-A-Video, a text-guided latent diffusion +framework for video upscaling. This framework ensures temporal coherence +through two key mechanisms: locally, it integrates temporal layers into U-Net +and VAE-Decoder, maintaining consistency within short sequences; globally, +without training, a flow-guided recurrent latent propagation module is +introduced to enhance overall video stability by propagating and fusing latent +across the entire sequences. Thanks to the diffusion paradigm, our model also +offers greater flexibility by allowing text prompts to guide texture creation +and adjustable noise levels to balance restoration and generation, enabling a +trade-off between fidelity and quality. Extensive experiments show that +Upscale-A-Video surpasses existing methods in both synthetic and real-world +benchmarks, as well as in AI-generated videos, showcasing impressive visual +realism and temporal consistency.",cs.CV,['cs.CV'] +Image Neural Field Diffusion Models,Yinbo Chen · Oliver Wang · Richard Zhang · Eli Shechtman · Xiaolong Wang · Michaël Gharbi, ,https://arxiv.org/abs/2310.08337,,2310.08337.pdf,Neural Diffusion Models,"Diffusion models have shown remarkable performance on many generative tasks. +Despite recent success, most diffusion models are restricted in that they only +allow linear transformation of the data distribution. In contrast, broader +family of transformations can potentially help train generative distributions +more efficiently, simplifying the reverse process and closing the gap between +the true negative log-likelihood and the variational approximation. In this +paper, we present Neural Diffusion Models (NDMs), a generalization of +conventional diffusion models that enables defining and learning time-dependent +non-linear transformations of data. We show how to optimise NDMs using a +variational bound in a simulation-free setting. Moreover, we derive a +time-continuous formulation of NDMs, which allows fast and reliable inference +using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the +utility of NDMs with learnable transformations through experiments on standard +image generation benchmarks, including CIFAR-10, downsampled versions of +ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms +of likelihood and produce high-quality samples.",cs.LG,"['cs.LG', 'stat.ML']" +ViTamin: Designing Scalable Vision Models in the Vision-Language Era,Jieneng Chen · Qihang Yu · Xiaohui Shen · Alan L. Yuille · Liang-Chieh Chen, ,https://arxiv.org/abs/2404.02132,,2404.02132.pdf,ViTamin: Designing Scalable Vision Models in the Vision-Language Era,"Recent breakthroughs in vision-language models (VLMs) start a new page in the +vision community. The VLMs provide stronger and more generalizable feature +embeddings compared to those from ImageNet-pretrained models, thanks to the +training on the large-scale Internet image-text pairs. However, despite the +amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain +the default choice for the image encoder. Although pure transformer proves its +effectiveness in the text encoding area, it remains questionable whether it is +also the case for image encoding, especially considering that various types of +networks are proposed on the ImageNet benchmark, which, unfortunately, are +rarely studied in VLMs. Due to small data/model scale, the original conclusions +of model design on ImageNet can be limited and biased. In this paper, we aim at +building an evaluation protocol of vision models in the vision-language era +under the contrastive language-image pretraining (CLIP) framework. We provide a +comprehensive way to benchmark different vision models, covering their +zero-shot performance and scalability in both model and training data sizes. To +this end, we introduce ViTamin, a new vision models tailored for VLMs. +ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, +when using the same publicly available DataComp-1B dataset and the same +OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse +benchmarks, including classification, retrieval, open-vocabulary detection and +segmentation, and large multi-modal models. When further scaling up the model +size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot +accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters +(4.4B).",cs.CV,['cs.CV'] +Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery,Mubashir Noman · Muzammal Naseer · Hisham Cholakkal · Rao Anwer · Salman Khan · Fahad Shahbaz Khan, ,https://web3.arxiv.org/abs/2403.05419,,2403.05419.pdf,Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery,"Recent advances in unsupervised learning have demonstrated the ability of +large vision models to achieve promising results on downstream tasks by +pre-training on large amount of unlabelled data. Such pre-training techniques +have also been explored recently in the remote sensing domain due to the +availability of large amount of unlabelled data. Different from standard +natural image datasets, remote sensing data is acquired from various sensor +technologies and exhibit diverse range of scale variations as well as +modalities. Existing satellite image pre-training methods either ignore the +scale information present in the remote sensing imagery or restrict themselves +to use only a single type of data modality. In this paper, we re-visit +transformers pre-training and leverage multi-scale information that is +effectively utilized with multiple modalities. Our proposed approach, named +SatMAE++, performs multi-scale pre-training and utilizes convolution based +upsampling blocks to reconstruct the image at higher scales making it +extensible to include more scales. Compared to existing works, the proposed +SatMAE++ with multi-scale pre-training is equally effective for both optical as +well as multi-spectral imagery. Extensive experiments on six datasets reveal +the merits of proposed contributions, leading to state-of-the-art performance +on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5\% +for multi-label classification task on BigEarthNet dataset. Our code and +pre-trained models are available at \url{https://github.com/techmn/satmae_pp}.",cs.CV,['cs.CV'] +RCL: Reliable Continual Learning for Unified Failure Detection,Fei Zhu · Zhen Cheng · Xu-Yao Zhang · Cheng-Lin Liu · Zhaoxiang Zhang, ,https://arxiv.org/abs/2403.02886,,2403.02886.pdf,Revisiting Confidence Estimation: Towards Reliable Failure Prediction,"Reliable confidence estimation is a challenging yet fundamental requirement +in many risk-sensitive applications. However, modern deep neural networks are +often overconfident for their incorrect predictions, i.e., misclassified +samples from known classes, and out-of-distribution (OOD) samples from unknown +classes. In recent years, many confidence calibration and OOD detection methods +have been developed. In this paper, we find a general, widely existing but +actually-neglected phenomenon that most confidence estimation methods are +harmful for detecting misclassification errors. We investigate this problem and +reveal that popular calibration and OOD detection methods often lead to worse +confidence separation between correctly classified and misclassified examples, +making it difficult to decide whether to trust a prediction or not. Finally, we +propose to enlarge the confidence gap by finding flat minima, which yields +state-of-the-art failure prediction performance under various settings +including balanced, long-tailed, and covariate-shift classification scenarios. +Our study not only provides a strong baseline for reliable confidence +estimation but also acts as a bridge between understanding calibration, OOD +detection, and failure prediction. The code is available at +\url{https://github.com/Impression2805/FMFP}.",cs.CV,"['cs.CV', 'cs.LG']" +Boosting Neural Representations for Videos with a Conditional Decoder,XINJIE ZHANG · Ren Yang · Dailan He · Xingtong Ge · Tongda Xu · Yan Wang · Hongwei Qin · Jun Zhang,https://github.com/Xinjie-Q/Boosting-NeRV,https://arxiv.org/abs/2402.18152,,2402.18152.pdf,Boosting Neural Representations for Videos with a Conditional Decoder,"Implicit neural representations (INRs) have emerged as a promising approach +for video storage and processing, showing remarkable versatility across various +video tasks. However, existing methods often fail to fully leverage their +representation capabilities, primarily due to inadequate alignment of +intermediate features during target frame decoding. This paper introduces a +universal boosting framework for current implicit video representation +approaches. Specifically, we utilize a conditional decoder with a +temporal-aware affine transform module, which uses the frame index as a prior +condition to effectively align intermediate features with target frames. +Besides, we introduce a sinusoidal NeRV-like block to generate diverse +intermediate features and achieve a more balanced parameter distribution, +thereby enhancing the model's capacity. With a high-frequency +information-preserving reconstruction loss, our approach successfully boosts +multiple baseline INRs in the reconstruction quality and convergence speed for +video regression, and exhibits superior inpainting and interpolation results. +Further, we integrate a consistent entropy minimization technique and develop +video codecs based on these boosted INRs. Experiments on the UVG dataset +confirm that our enhanced codecs significantly outperform baseline INRs and +offer competitive rate-distortion performance compared to traditional and +learning-based codecs. Code is available at +https://github.com/Xinjie-Q/Boosting-NeRV.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV']" +Uncertainty Visualization via Low-Dimensional Posterior Projections,Omer Yair · Tomer Michaeli · Elias Nehme, ,https://arxiv.org/abs/2312.07804,,2312.07804.pdf,Uncertainty Visualization via Low-Dimensional Posterior Projections,"In ill-posed inverse problems, it is commonly desirable to obtain insight +into the full spectrum of plausible solutions, rather than extracting only a +single reconstruction. Information about the plausible solutions and their +likelihoods is encoded in the posterior distribution. However, for +high-dimensional data, this distribution is challenging to visualize. In this +work, we introduce a new approach for estimating and visualizing posteriors by +employing energy-based models (EBMs) over low-dimensional subspaces. +Specifically, we train a conditional EBM that receives an input measurement and +a set of directions that span some low-dimensional subspace of solutions, and +outputs the probability density function of the posterior within that space. We +demonstrate the effectiveness of our method across a diverse range of datasets +and image restoration problems, showcasing its strength in uncertainty +quantification and visualization. As we show, our method outperforms a baseline +that projects samples from a diffusion-based posterior sampler, while being +orders of magnitude faster. Furthermore, it is more accurate than a baseline +that assumes a Gaussian posterior.",cs.CV,['cs.CV'] +ElasticDiffusion: Training-free Arbitrary Size Image Generation,Moayed Haji Ali · Guha Balakrishnan · Vicente Ordonez, ,https://arxiv.org/abs/2311.18822,,2311.18822.pdf,ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation,"Diffusion models have revolutionized image generation in recent years, yet +they are still limited to a few sizes and aspect ratios. We propose +ElasticDiffusion, a novel training-free decoding method that enables pretrained +text-to-image diffusion models to generate images with various sizes. +ElasticDiffusion attempts to decouple the generation trajectory of a pretrained +model into local and global signals. The local signal controls low-level pixel +information and can be estimated on local patches, while the global signal is +used to maintain overall structural consistency and is estimated with a +reference image. We test our method on CelebA-HQ (faces) and LAION-COCO +(objects/indoor/outdoor scenes). Our experiments and qualitative results show +superior image coherence quality across aspect ratios compared to +MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project +page: https://elasticdiffusion.github.io/",cs.CV,['cs.CV'] +Exploiting Diffusion Prior for Generalizable Dense Prediction,Hsin-Ying Lee · Hung-Yu Tseng · Hsin-Ying Lee · Ming-Hsuan Yang,https://shinying.github.io/dmp,https://arxiv.org/abs/2311.18832,,2311.18832.pdf,Exploiting Diffusion Prior for Generalizable Dense Prediction,"Contents generated by recent advanced Text-to-Image (T2I) diffusion models +are sometimes too imaginative for existing off-the-shelf dense predictors to +estimate due to the immitigable domain gap. We introduce DMP, a pipeline +utilizing pre-trained T2I models as a prior for dense prediction tasks. To +address the misalignment between deterministic prediction tasks and stochastic +T2I models, we reformulate the diffusion process through a sequence of +interpolations, establishing a deterministic mapping between input RGB images +and output prediction distributions. To preserve generalizability, we use +low-rank adaptation to fine-tune pre-trained models. Extensive experiments +across five tasks, including 3D property estimation, semantic segmentation, and +intrinsic image decomposition, showcase the efficacy of the proposed method. +Despite limited-domain training data, the approach yields faithful estimations +for arbitrary images, surpassing existing state-of-the-art algorithms.",cs.CV,['cs.CV'] +Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text,Junshu Tang · Yanhong Zeng · Ke Fan · Xuheng Wang · Bo Dai · Kai Chen · Lizhuang Ma, ,https://arxiv.org/abs/2403.16897,,2403.16897.pdf,Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text,"Creating and animating 3D biped cartoon characters is crucial and valuable in +various applications. Compared with geometry, the diverse texture design plays +an important role in making 3D biped cartoon characters vivid and charming. +Therefore, we focus on automatic texture design for cartoon characters based on +input instructions. This is challenging for domain-specific requirements and a +lack of high-quality data. To address this challenge, we propose Make-It-Vivid, +the first attempt to enable high-quality texture generation from text in UV +space. We prepare a detailed text-texture paired data for 3D characters by +using vision-question-answering agents. Then we customize a pretrained +text-to-image model to generate texture map with template structure while +preserving the natural 2D image knowledge. Furthermore, to enhance fine-grained +details, we propose a novel adversarial learning scheme to shorten the domain +gap between original dataset and realistic texture domain. Extensive +experiments show that our approach outperforms current texture generation +methods, resulting in efficient character texturing and faithful generation +with prompts. Besides, we showcase various applications such as out of domain +generation and texture stylization. We also provide an efficient generation +system for automatic text-guided textured character generation and animation.",cs.CV,['cs.CV'] +Eclipse: Disambiguating Illumination and Materials using Unintended Shadows,Dor Verbin · Ben Mildenhall · Peter Hedman · Jonathan T. Barron · Todd Zickler · Pratul P. Srinivasan, ,,https://www.youtube.com/watch?v=amQLGyza3EU,,,,,nan +Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection,Zhiyuan Yan · Yuhao Luo · Siwei Lyu · Qingshan Liu · Baoyuan Wu, ,https://arxiv.org/abs/2311.11278v1,,2311.11278v1.pdf,Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection,"Deepfake detection faces a critical generalization hurdle, with performance +deteriorating when there is a mismatch between the distributions of training +and testing data. A broadly received explanation is the tendency of these +detectors to be overfitted to forgery-specific artifacts, rather than learning +features that are widely applicable across various forgeries. To address this +issue, we propose a simple yet effective detector called LSDA +(\underline{L}atent \underline{S}pace \underline{D}ata +\underline{A}ugmentation), which is based on a heuristic idea: representations +with a wider variety of forgeries should be able to learn a more generalizable +decision boundary, thereby mitigating the overfitting of method-specific +features (see Figure. 1). Following this idea, we propose to enlarge the +forgery space by constructing and simulating variations within and across +forgery features in the latent space. This approach encompasses the acquisition +of enriched, domain-specific features and the facilitation of smoother +transitions between different forgery types, effectively bridging domain gaps. +Our approach culminates in refining a binary classifier that leverages the +distilled knowledge from the enhanced features, striving for a generalizable +deepfake detector. Comprehensive experiments show that our proposed method is +surprisingly effective and transcends state-of-the-art detectors across several +widely used benchmarks.",cs.CV,['cs.CV'] +Revisiting Sampson Approximations for Geometric Estimation Problems,Felix Rydell · Angelica Torres · Viktor Larsson, ,https://arxiv.org/abs/2401.07114,,2401.07114.pdf,Revisiting Sampson Approximations for Geometric Estimation Problems,"Many problems in computer vision can be formulated as geometric estimation +problems, i.e. given a collection of measurements (e.g. point correspondences) +we wish to fit a model (e.g. an essential matrix) that agrees with our +observations. This necessitates some measure of how much an observation +``agrees"" with a given model. A natural choice is to consider the smallest +perturbation that makes the observation exactly satisfy the constraints. +However, for many problems, this metric is expensive or otherwise intractable +to compute. The so-called Sampson error approximates this geometric error +through a linearization scheme. For epipolar geometry, the Sampson error is a +popular choice and in practice known to yield very tight approximations of the +corresponding geometric residual (the reprojection error). + In this paper we revisit the Sampson approximation and provide new +theoretical insights as to why and when this approximation works, as well as +provide explicit bounds on the tightness under some mild assumptions. Our +theoretical results are validated in several experiments on real data and in +the context of different geometric estimation tasks.",cs.CV,"['cs.CV', 'math.AG', '68T45 (Primary), 14Q99 (Secondary), 68W30']" +Pick-or-Mix: Dynamic Channel Sampling for ConvNets,Ashish Kumar · Daneul Kim · Jaesik Park · Laxmidhar Behera, ,,https://openreview.net/forum?id=Howb7fXB4V,,,,,nan +FreePoint: Unsupervised Point Cloud Instance Segmentation,Zhikai Zhang · Jian Ding · Li Jiang · Dengxin Dai · Gui-Song Xia, ,,https://medium.com/forestree/reviewing-unsupervised-semantic-segmentation-methods-for-point-cloud-a50a508f7f88,,,,,nan +Mind marginal non-crack regions: Clustering-inspired representation learning for crack segmentation,zhuangzhuang chen · Zhuonan Lai · Jie Chen · Jianqiang Li, ,https://arxiv.org/html/2403.03063v1,,2403.03063v1.pdf,CrackNex: a Few-shot Low-light Crack Segmentation Model Based on Retinex Theory for UAV Inspections,"Routine visual inspections of concrete structures are imperative for +upholding the safety and integrity of critical infrastructure. Such visual +inspections sometimes happen under low-light conditions, e.g., checking for +bridge health. Crack segmentation under such conditions is challenging due to +the poor contrast between cracks and their surroundings. However, most deep +learning methods are designed for well-illuminated crack images and hence their +performance drops dramatically in low-light scenes. In addition, conventional +approaches require many annotated low-light crack images which is +time-consuming. In this paper, we address these challenges by proposing +CrackNex, a framework that utilizes reflectance information based on Retinex +Theory to help the model learn a unified illumination-invariant representation. +Furthermore, we utilize few-shot segmentation to solve the inefficient training +data problem. In CrackNex, both a support prototype and a reflectance prototype +are extracted from the support set. Then, a prototype fusion module is designed +to integrate the features from both prototypes. CrackNex outperforms the SOTA +methods on multiple datasets. Additionally, we present the first benchmark +dataset, LCSD, for low-light crack segmentation. LCSD consists of 102 +well-illuminated crack images and 41 low-light crack images. The dataset and +code are available at https://github.com/zy1296/CrackNex.",cs.CV,['cs.CV'] +MV-Adapter: Exploring Parameter Efficient Learning for Video Text Retrieval,bowen zhang · Xiaojie Jin · Weibo Gong · Kai Xu · Xueqing Deng · Peng Wang · Zhao Zhang · Xiaohui Shen · Jiashi Feng, ,https://arxiv.org/abs/2405.19465,,2405.19465.pdf,RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter,"Text-Video Retrieval (TVR) aims to align relevant video content with natural +language queries. To date, most state-of-the-art TVR methods learn +image-to-video transfer learning based on large-scale pre-trained +visionlanguage models (e.g., CLIP). However, fully fine-tuning these +pre-trained models for TVR incurs prohibitively expensive computation costs. To +this end, we propose to conduct efficient text-video Retrieval with a +sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model +with a few parameterized layers. To accommodate the text-video scenario, we +equip our RAP with two indispensable characteristics: temporal sparsity and +correlation. Specifically, we propose a low-rank modulation module to refine +the per-image features from the frozen CLIP backbone, which accentuates salient +frames within the video features while alleviating temporal redundancy. +Besides, we introduce an asynchronous self-attention mechanism that first +selects the top responsive visual patches and augments the correlation modeling +between them with learnable temporal and patch offsets. Extensive experiments +on four TVR datasets demonstrate that RAP achieves superior or comparable +performance compared to the fully fine-tuned counterpart and other +parameter-efficient fine-tuning methods.",cs.CV,['cs.CV'] +Few-shot Learner Parameterization by Diffusion Time-steps,Zhongqi Yue · Pan Zhou · Richang Hong · Hanwang Zhang · Qianru Sun, ,https://arxiv.org/abs/2403.02649,,2403.02649.pdf,Few-shot Learner Parameterization by Diffusion Time-steps,"Even when using large multi-modal foundation models, few-shot learning is +still challenging -- if there is no proper inductive bias, it is nearly +impossible to keep the nuanced class attributes while removing the visually +prominent attributes that spuriously correlate with class labels. To this end, +we find an inductive bias that the time-steps of a Diffusion Model (DM) can +isolate the nuanced class attributes, i.e., as the forward diffusion adds noise +to an image at each time-step, nuanced attributes are usually lost at an +earlier time-step than the spurious attributes that are visually prominent. +Building on this, we propose Time-step Few-shot (TiF) learner. We train +class-specific low-rank adapters for a text-conditioned DM to make up for the +lost attributes, such that images can be accurately reconstructed from their +noisy ones given a prompt. Hence, at a small time-step, the adapter and prompt +are essentially a parameterization of only the nuanced class attributes. For a +test image, we can use the parameterization to only extract the nuanced class +attributes for classification. TiF learner significantly outperforms OpenCLIP +and its adapters on a variety of fine-grained and customized few-shot learning +tasks. Codes are in https://github.com/yue-zhongqi/tif.",cs.CV,['cs.CV'] +"SPIN: Simultaneous Perception, Interaction and Navigation",Shagun Uppal · Ananye Agarwal · Haoyu Xiong · Kenneth Shaw · Deepak Pathak, ,https://arxiv.org/abs/2405.07991,,2405.07991.pdf,"SPIN: Simultaneous Perception, Interaction and Navigation","While there has been remarkable progress recently in the fields of +manipulation and locomotion, mobile manipulation remains a long-standing +challenge. Compared to locomotion or static manipulation, a mobile system must +make a diverse range of long-horizon tasks feasible in unstructured and dynamic +environments. While the applications are broad and interesting, there are a +plethora of challenges in developing these systems such as coordination between +the base and arm, reliance on onboard perception for perceiving and interacting +with the environment, and most importantly, simultaneously integrating all +these parts together. Prior works approach the problem using disentangled +modular skills for mobility and manipulation that are trivially tied together. +This causes several limitations such as compounding errors, delays in +decision-making, and no whole-body coordination. In this work, we present a +reactive mobile manipulation framework that uses an active visual system to +consciously perceive and react to its environment. Similar to how humans +leverage whole-body and hand-eye coordination, we develop a mobile manipulator +that exploits its ability to move and see, more specifically -- to move in +order to see and to see in order to move. This allows it to not only move +around and interact with its environment but also, choose ""when"" to perceive +""what"" using an active visual system. We observe that such an agent learns to +navigate around complex cluttered scenarios while displaying agile whole-body +coordination using only ego-vision without needing to create environment maps. +Results visualizations and videos at https://spin-robot.github.io/",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG', 'cs.SY', 'eess.SY']" +Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery,Siddharth Tourani · Ahmed Alwheibi · Arif Mahmood · Muhammad Haris Khan, ,https://arxiv.org/abs/2403.16194,,2403.16194.pdf,Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery,"Unsupervised landmarks discovery (ULD) for an object category is a +challenging computer vision problem. In pursuit of developing a robust ULD +framework, we explore the potential of a recent paradigm of self-supervised +learning algorithms, known as diffusion models. Some recent works have shown +that these models implicitly contain important correspondence cues. Towards +harnessing the potential of diffusion models for the ULD task, we make the +following core contributions. First, we propose a ZeroShot ULD baseline based +on simple clustering of random pixel locations with nearest neighbour matching. +It delivers better results than existing ULD methods. Second, motivated by the +ZeroShot performance, we develop a ULD algorithm based on diffusion features +using self-training and clustering which also outperforms prior methods by +notable margins. Third, we introduce a new proxy task based on generating +latent pose codes and also propose a two-stage clustering mechanism to +facilitate effective pseudo-labeling, resulting in a significant performance +improvement. Overall, our approach consistently outperforms state-of-the-art +methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by +significant margins.",cs.CV,['cs.CV'] +Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency,Yuqi Zhang · Han Luo · Yinjie Lei, ,https://arxiv.org/abs/2311.15383,,2311.15383.pdf,Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding,"3D Visual Grounding (3DVG) aims at localizing 3D object based on textual +descriptions. Conventional supervised methods for 3DVG often necessitate +extensive annotations and a predefined vocabulary, which can be restrictive. To +address this issue, we propose a novel visual programming approach for +zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language +models (LLMs). Our approach begins with a unique dialog-based method, engaging +with LLMs to establish a foundational understanding of zero-shot 3DVG. Building +on this, we design a visual program that consists of three types of modules, +i.e., view-independent, view-dependent, and functional modules. These modules, +specifically tailored for 3D scenarios, work collaboratively to perform complex +reasoning and inference. Furthermore, we develop an innovative language-object +correlation module to extend the scope of existing 3D object detectors into +open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot +approach can outperform some supervised baselines, marking a significant stride +towards effective 3DVG.",cs.CV,['cs.CV'] +POPDG: Popular 3D Dance Generation with PopDanceSet,Zhenye Luo · Min Ren · Xuecai Hu · Yongzhen Huang · Li Yao, ,https://arxiv.org/abs/2405.03178,,2405.03178.pdf,POPDG: Popular 3D Dance Generation with PopDanceSet,"Generating dances that are both lifelike and well-aligned with music +continues to be a challenging task in the cross-modal domain. This paper +introduces PopDanceSet, the first dataset tailored to the preferences of young +audiences, enabling the generation of aesthetically oriented dances. And it +surpasses the AIST++ dataset in music genre diversity and the intricacy and +depth of dance movements. Moreover, the proposed POPDG model within the iDDPM +framework enhances dance diversity and, through the Space Augmentation +Algorithm, strengthens spatial physical connections between human body joints, +ensuring that increased diversity does not compromise generation quality. A +streamlined Alignment Module is also designed to improve the temporal alignment +between dance and music. Extensive experiments show that POPDG achieves SOTA +results on two datasets. Furthermore, the paper also expands on current +evaluation metrics. The dataset and code are available at +https://github.com/Luke-Luo1/POPDG.",cs.SD,"['cs.SD', 'eess.AS']" +CLiC: Concept Learning in Context,Mehdi Safaee · Aryan Mikaeili · Or Patashnik · Daniel Cohen-Or · Ali Mahdavi Amiri, ,https://arxiv.org/abs/2311.17083,,2311.17083.pdf,CLiC: Concept Learning in Context,"This paper addresses the challenge of learning a local visual pattern of an +object from one image, and generating images depicting objects with that +pattern. Learning a localized concept and placing it on an object in a target +image is a nontrivial task, as the objects may have different orientations and +shapes. Our approach builds upon recent advancements in visual concept +learning. It involves acquiring a visual concept (e.g., an ornament) from a +source image and subsequently applying it to an object (e.g., a chair) in a +target image. Our key idea is to perform in-context concept learning, acquiring +the local visual concept within the broader context of the objects they belong +to. To localize the concept learning, we employ soft masks that contain both +the concept within the mask and the surrounding image area. We demonstrate our +approach through object generation within an image, showcasing plausible +embedding of in-context learned concepts. We also introduce methods for +directing acquired concepts to specific locations within target images, +employing cross-attention mechanisms, and establishing correspondences between +source and target objects. The effectiveness of our method is demonstrated +through quantitative and qualitative experiments, along with comparisons +against baseline techniques.",cs.CV,['cs.CV'] +Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing,Xun Lin · Shuai Wang · RIZHAO CAI · Yizhong Liu · Ying Fu · Wenzhong Tang · Zitong YU · Alex C. Kot, ,https://arxiv.org/abs/2402.19298,,2402.19298.pdf,Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing,"Face Anti-Spoofing (FAS) is crucial for securing face recognition systems +against presentation attacks. With advancements in sensor manufacture and +multi-modal learning techniques, many multi-modal FAS approaches have emerged. +However, they face challenges in generalizing to unseen attacks and deployment +conditions. These challenges arise from (1) modality unreliability, where some +modality sensors like depth and infrared undergo significant domain shifts in +varying environments, leading to the spread of unreliable information during +cross-modal feature fusion, and (2) modality imbalance, where training overly +relies on a dominant modality hinders the convergence of others, reducing +effectiveness against attack types that are indistinguishable sorely using the +dominant modality. To address modality unreliability, we propose the +Uncertainty-Guided Cross-Adapter (U-Adapter) to recognize unreliably detected +regions within each modality and suppress the impact of unreliable regions on +other modalities. For modality imbalance, we propose a Rebalanced Modality +Gradient Modulation (ReGrad) strategy to rebalance the convergence speed of all +modalities by adaptively adjusting their gradients. Besides, we provide the +first large-scale benchmark for evaluating multi-modal FAS performance under +domain generalization scenarios. Extensive experiments demonstrate that our +method outperforms state-of-the-art methods. Source code and protocols will be +released on https://github.com/OMGGGGG/mmdg.",cs.CV,['cs.CV'] +Alchemist: Parametric Control of Material Properties with Diffusion Models,Prafull Sharma · Varun Jampani · Yuanzhen Li · Xuhui Jia · Dmitry Lagun · Fredo Durand · William Freeman · Mark Matthews, ,https://arxiv.org/abs/2312.02970,,2312.02970.pdf,Alchemist: Parametric Control of Material Properties with Diffusion Models,"We propose a method to control material attributes of objects like roughness, +metallic, albedo, and transparency in real images. Our method capitalizes on +the generative prior of text-to-image models known for photorealism, employing +a scalar value and instructions to alter low-level material properties. +Addressing the lack of datasets with controlled material attributes, we +generated an object-centric synthetic dataset with physically-based materials. +Fine-tuning a modified pre-trained text-to-image model on this synthetic +dataset enables us to edit material properties in real-world images while +preserving all other attributes. We show the potential application of our model +to material edited NeRFs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']" +Noisy One-point Homographies are Surprisingly Good,Yaqing Ding · Jonathan Astermark · Magnus Oskarsson · Viktor Larsson, ,,https://vlarsson.github.io/publications/,,,,,nan +Small Scale Data-Free Knowledge Distillation,He Liu · Yikai Wang · Huaping Liu · Fuchun Sun · Anbang Yao, ,https://arxiv.org/abs/2403.19539,,2403.19539.pdf,De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts,"Data-Free Knowledge Distillation (DFKD) is a promising task to train +high-performance small models to enhance actual deployment without relying on +the original training data. Existing methods commonly avoid relying on private +data by utilizing synthetic or sampled data. However, a long-overlooked issue +is that the severe distribution shifts between their substitution and original +data, which manifests as huge differences in the quality of images and class +proportions. The harmful shifts are essentially the confounder that +significantly causes performance bottlenecks. To tackle the issue, this paper +proposes a novel perspective with causal inference to disentangle the student +models from the impact of such shifts. By designing a customized causal graph, +we first reveal the causalities among the variables in the DFKD task. +Subsequently, we propose a Knowledge Distillation Causal Intervention (KDCI) +framework based on the backdoor adjustment to de-confound the confounder. KDCI +can be flexibly combined with most existing state-of-the-art baselines. +Experiments in combination with six representative DFKD methods demonstrate the +effectiveness of our KDCI, which can obviously help existing methods under +almost all settings, \textit{e.g.}, improving the baseline by up to 15.54\% +accuracy on the CIFAR-100 dataset.",cs.CV,['cs.CV'] +Efficient Multitask Dense Predictor via Binarization,Yuzhang Shang · Dan Xu · Gaowen Liu · Ramana Kompella · Yan Yan, ,https://arxiv.org/abs/2405.14136,,2405.14136.pdf,Efficient Multitask Dense Predictor via Binarization,"Multi-task learning for dense prediction has emerged as a pivotal area in +computer vision, enabling simultaneous processing of diverse yet interrelated +pixel-wise prediction tasks. However, the substantial computational demands of +state-of-the-art (SoTA) models often limit their widespread deployment. This +paper addresses this challenge by introducing network binarization to compress +resource-intensive multi-task dense predictors. Specifically, our goal is to +significantly accelerate multi-task dense prediction models via Binary Neural +Networks (BNNs) while maintaining and even improving model performance at the +same time. To reach this goal, we propose a Binary Multi-task Dense Predictor, +Bi-MTDP, and several variants of Bi-MTDP, in which a multi-task dense predictor +is constructed via specified binarized modules. Our systematical analysis of +this predictor reveals that performance drop from binarization is primarily +caused by severe information degradation. To address this issue, we introduce a +deep information bottleneck layer that enforces representations for downstream +tasks satisfying Gaussian distribution in forward propagation. Moreover, we +introduce a knowledge distillation mechanism to correct the direction of +information flow in backward propagation. Intriguingly, one variant of Bi-MTDP +outperforms full-precision (FP) multi-task dense prediction SoTAs, ARTC +(CNN-based) and InvPT (ViT-Based). This result indicates that Bi-MTDP is not +merely a naive trade-off between performance and efficiency, but is rather a +benefit of the redundant information flow thanks to the multi-task +architecture. Code is available at https://github.com/42Shawn/BiMTDP.",cs.CV,['cs.CV'] +Neural Super-Resolution for Real-time Rendering with Radiance Demodulation,Jia Li · Ziling Chen · Xiaolong Wu · Lu Wang · Beibei Wang · Lei Zhang, ,https://arxiv.org/abs/2308.06699,,2308.06699.pdf,Neural Super-Resolution for Real-time Rendering with Radiance Demodulation,"It is time-consuming to render high-resolution images in applications such as +video games and virtual reality, and thus super-resolution technologies become +increasingly popular for real-time rendering. However, it is challenging to +preserve sharp texture details, keep the temporal stability and avoid the +ghosting artifacts in real-time super-resolution rendering. To address this +issue, we introduce radiance demodulation to separate the rendered image or +radiance into a lighting component and a material component, considering the +fact that the light component is smoother than the rendered image so that the +high-resolution material component with detailed textures can be easily +obtained. We perform the super-resolution on the lighting component only and +re-modulate it with the high-resolution material component to obtain the final +super-resolution image with more texture details. A reliable warping module is +proposed by explicitly marking the occluded regions to avoid the ghosting +artifacts. To further enhance the temporal stability, we design a +frame-recurrent neural network and a temporal loss to aggregate the previous +and current frames, which can better capture the spatial-temporal consistency +among reconstructed frames. As a result, our method is able to produce +temporally stable results in real-time rendering with high-quality details, +even in the challenging 4 $\times$ 4 super-resolution scenarios.",cs.GR,['cs.GR'] +Multiple View Geometry Transformers for 3D Human Pose Estimation,Ziwei Liao · jialiang zhu · Chunyu Wang · Han Hu · Steven L. Waslander, ,https://arxiv.org/abs/2311.10983,,2311.10983.pdf,Multiple View Geometry Transformers for 3D Human Pose Estimation,"In this work, we aim to improve the 3D reasoning ability of Transformers in +multi-view 3D human pose estimation. Recent works have focused on end-to-end +learning-based transformer designs, which struggle to resolve geometric +information accurately, particularly during occlusion. Instead, we propose a +novel hybrid model, MVGFormer, which has a series of geometric and appearance +modules organized in an iterative manner. The geometry modules are +learning-free and handle all viewpoint-dependent 3D tasks geometrically which +notably improves the model's generalization ability. The appearance modules are +learnable and are dedicated to estimating 2D poses from image signals +end-to-end which enables them to achieve accurate estimates even when occlusion +occurs, leading to a model that is both accurate and generalizable to new +cameras and geometries. We evaluate our approach for both in-domain and +out-of-domain settings, where our model consistently outperforms +state-of-the-art methods, and especially does so by a significant margin in the +out-of-domain setting. We will release the code and models: +https://github.com/XunshanMan/MVGFormer.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Efficient Scene Recovery Using Luminous Flux Prior,ZhongYu Li · Lei Zhang, ,,,,,,,nan +ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe,Yifan Bai · Zeyang Zhao · Yihong Gong · Xing Wei, ,https://arxiv.org/abs/2312.17133,,2312.17133.pdf,ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe,"We present ARTrackV2, which integrates two pivotal aspects of tracking: +determining where to look (localization) and how to describe (appearance +analysis) the target object across video frames. Building on the foundation of +its predecessor, ARTrackV2 extends the concept by introducing a unified +generative framework to ""read out"" object's trajectory and ""retell"" its +appearance in an autoregressive manner. This approach fosters a time-continuous +methodology that models the joint evolution of motion and visual features, +guided by previous estimates. Furthermore, ARTrackV2 stands out for its +efficiency and simplicity, obviating the less efficient intra-frame +autoregression and hand-tuned parameters for appearance updates. Despite its +simplicity, ARTrackV2 achieves state-of-the-art performance on prevailing +benchmark datasets while demonstrating remarkable efficiency improvement. In +particular, ARTrackV2 achieves AO score of 79.5\% on GOT-10k, and AUC of 86.1\% +on TrackingNet while being $3.6 \times$ faster than ARTrack. The code will be +released.",cs.CV,['cs.CV'] +ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion,Jiayu Yang · Ziang Cheng · Yunfei Duan · Pan Ji · Hongdong Li, ,https://arxiv.org/abs/2310.10343,,2310.10343.pdf,ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion,"Given a single image of a 3D object, this paper proposes a novel method +(named ConsistNet) that is able to generate multiple images of the same object, +as if seen they are captured from different viewpoints, while the 3D +(multi-view) consistencies among those multiple generated images are +effectively exploited. Central to our method is a multi-view consistency block +which enables information exchange across multiple single-view diffusion +processes based on the underlying multi-view geometry principles. ConsistNet is +an extension to the standard latent diffusion model, and consists of two +sub-modules: (a) a view aggregation module that unprojects multi-view features +into global 3D volumes and infer consistency, and (b) a ray aggregation module +that samples and aggregate 3D consistent features back to each view to enforce +consistency. Our approach departs from previous methods in multi-view image +generation, in that it can be easily dropped-in pre-trained LDMs without +requiring explicit pixel correspondences or depth prediction. Experiments show +that our method effectively learns 3D consistency over a frozen Zero123 +backbone and can generate 16 surrounding views of the object within 40 seconds +on a single A100 GPU. Our code will be made available on +https://github.com/JiayuYANG/ConsistNet",cs.CV,['cs.CV'] +Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction,Junuk Cha · Jihyeon Kim · Jae Shin Yoon · Seungryul Baek, ,https://arxiv.org/abs/2404.00562,,2404.00562.pdf,Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction,"This paper introduces the first text-guided work for generating the sequence +of hand-object interaction in 3D. The main challenge arises from the lack of +labeled data where existing ground-truth datasets are nowhere near +generalizable in interaction type and object category, which inhibits the +modeling of diverse 3D hand-object interaction with the correct physical +implication (e.g., contacts and semantics) from text prompts. To address this +challenge, we propose to decompose the interaction generation task into two +subtasks: hand-object contact generation; and hand-object motion generation. +For contact generation, a VAE-based network takes as input a text and an object +mesh, and generates the probability of contacts between the surfaces of hands +and the object during the interaction. The network learns a variety of local +geometry structure of diverse objects that is independent of the objects' +category, and thus, it is applicable to general objects. For motion generation, +a Transformer-based diffusion model utilizes this 3D contact map as a strong +prior for generating physically plausible hand-object motion as a function of +text prompts by learning from the augmented labeled dataset; where we annotate +text labels from many existing 3D hand and object motion data. Finally, we +further introduce a hand refiner module that minimizes the distance between the +object surface and hand joints to improve the temporal stability of the +object-hand contacts and to suppress the penetration artifacts. In the +experiments, we demonstrate that our method can generate more realistic and +diverse interactions compared to other baseline methods. We also show that our +method is applicable to unseen objects. We will release our model and newly +labeled data as a strong foundation for future research. Codes and data are +available in: https://github.com/JunukCha/Text2HOI.",cs.CV,['cs.CV'] +Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image,Yiqun Mei · Yu Zeng · He Zhang · Zhixin Shu · Xuaner Zhang · Sai Bi · Jianming Zhang · HyunJoon Jung · Vishal M. Patel,https://yiqunmei.net/holo-web/,https://arxiv.org/abs/2403.09632,,2403.09632.pdf,Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image,"At the core of portrait photography is the search for ideal lighting and +viewpoint. The process often requires advanced knowledge in photography and an +elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric +relighting method that is capable of synthesizing novel viewpoints, and novel +lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN +(EG3D) to reconstruct geometry and appearance from an input portrait as a set +of 3D-aware features. We design a relighting module conditioned on a given +lighting to process these features, and predict a relit 3D representation in +the form of a tri-plane, which can render to an arbitrary viewpoint through +volume rendering. Besides viewpoint and lighting control, Holo-Relighting also +takes the head pose as a condition to enable head-pose-dependent lighting +effects. With these novel designs, Holo-Relighting can generate complex +non-Lambertian lighting effects (e.g., specular highlights and cast shadows) +without using any explicit physical lighting priors. We train Holo-Relighting +with data captured with a light stage, and propose two data-rendering +techniques to improve the data quality for training the volumetric relighting +system. Through quantitative and qualitative experiments, we demonstrate +Holo-Relighting can achieve state-of-the-arts relighting quality with better +photorealism, 3D consistency and controllability.",cs.CV,['cs.CV'] +Uncertainty-Guided Never-Ending Learning to Drive,Lei Lai · Eshed Ohn-Bar · Sanjay Arora · John Yi, ,,https://paperswithcode.com/paper/learning-to-drive-anywhere,,,,,nan +Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation,Lin Long · Haobo Wang · Zhijie Jiang · Lei Feng · Chang Yao · Gang Chen · Junbo Zhao, ,https://arxiv.org/abs/2307.15973,,2307.15973.pdf,Debiased Pairwise Learning from Positive-Unlabeled Implicit Feedback,"Learning contrastive representations from pairwise comparisons has achieved +remarkable success in various fields, such as natural language processing, +computer vision, and information retrieval. Collaborative filtering algorithms +based on pairwise learning also rooted in this paradigm. A significant concern +is the absence of labels for negative instances in implicit feedback data, +which often results in the random selected negative instances contains false +negatives and inevitably, biased embeddings. To address this issue, we +introduce a novel correction method for sampling bias that yields a modified +loss for pairwise learning called debiased pairwise loss (DPL). The key idea +underlying DPL is to correct the biased probability estimates that result from +false negatives, thereby correcting the gradients to approximate those of fully +supervised data. The implementation of DPL only requires a small modification +of the codes. Experimental studies on five public datasets validate the +effectiveness of proposed learning method.",cs.IR,['cs.IR'] +Overcoming Generic Knowledge Loss with Selective Parameter Update,Wenxuan Zhang · Paul Janson · Rahaf Aljundi · Mohamed Elhoseiny, ,https://arxiv.org/abs/2308.12462,,2308.12462.pdf,Overcoming Generic Knowledge Loss with Selective Parameter Update,"Foundation models encompass an extensive knowledge base and offer remarkable +transferability. However, this knowledge becomes outdated or insufficient over +time. The challenge lies in continuously updating foundation models to +accommodate novel information while retaining their original capabilities. +Leveraging the fact that foundation models have initial knowledge on various +tasks and domains, we propose a novel approach that, instead of updating all +parameters equally, localizes the updates to a sparse set of parameters +relevant to the task being learned. We strike a balance between efficiency and +new task performance, while maintaining the transferability and +generalizability of foundation models. We extensively evaluate our method on +foundational vision-language models with a diverse spectrum of continual +learning tasks. Our method achieves improvements on the accuracy of the newly +learned tasks up to 7% while preserving the pretraining knowledge with a +negligible decrease of 0.9% on a representative control set accuracy.",cs.CV,['cs.CV'] +Projecting Trackable Thermal Patterns for Dynamic Computer Vision,Mark Sheinin · Aswin C. Sankaranarayanan · Srinivasa G. Narasimhan, ,,https://www.globotreks.com/destinations/canada/day-trips-manitoba-winnipeg/,,,,,nan +DePT: Decoupled Prompt Tuning,Ji Zhang · Shihan Wu · Lianli Gao · Heng Tao Shen · Jingkuan Song, ,https://arxiv.org/abs/2309.07439,,2309.07439.pdf,DePT: Decoupled Prompt Tuning,"This work breaks through the Base-New Tradeoff (BNT)dilemma in prompt tuning, +i.e., the better the tuned model generalizes to the base (or target) task, the +worse it generalizes to new tasks, and vice versa. Specifically, through an +in-depth analysis of the learned features of the base and new tasks, we observe +that the BNT stems from a channel bias issue, i.e., the vast majority of +feature channels are occupied by base-specific knowledge, resulting in the +collapse of taskshared knowledge important to new tasks. To address this, we +propose the Decoupled Prompt Tuning (DePT) framework, which decouples +base-specific knowledge from feature channels into an isolated feature space +during prompt tuning, so as to maximally preserve task-shared knowledge in the +original feature space for achieving better zero-shot generalization on new +tasks. Importantly, our DePT is orthogonal to existing prompt tuning methods, +hence it can improve all of them. Extensive experiments on 11 datasets show the +strong flexibility and effectiveness of DePT. Our code and pretrained models +are available at https://github.com/Koorye/DePT.",cs.CV,['cs.CV'] +Sharingan: A Transformer Architecture for Multi-Person Gaze Following,Samy Tafasca · Anshul Gupta · Jean-marc Odobez, ,https://arxiv.org/abs/2310.00816,,2310.00816.pdf,Sharingan: A Transformer-based Architecture for Gaze Following,"Gaze is a powerful form of non-verbal communication and social interaction +that humans develop from an early age. As such, modeling this behavior is an +important task that can benefit a broad set of application domains ranging from +robotics to sociology. In particular, Gaze Following is defined as the +prediction of the pixel-wise 2D location where a person in the image is +looking. Prior efforts in this direction have focused primarily on CNN-based +architectures to perform the task. In this paper, we introduce a novel +transformer-based architecture for 2D gaze prediction. We experiment with 2 +variants: the first one retains the same task formulation of predicting a gaze +heatmap for one person at a time, while the second one casts the problem as a +2D point regression and allows us to perform multi-person gaze prediction with +a single forward pass. This new architecture achieves state-of-the-art results +on the GazeFollow and VideoAttentionTarget datasets. The code for this paper +will be made publicly available.",cs.CV,['cs.CV'] +Fully Exploiting Every Real Sample: Super-Pixel Sample Gradient Model Stealing,Yunlong Zhao · Xiaoheng Deng · Yijing Liu · Xinjun Pei · Jiazhi Xia · Wei Chen, ,https://ar5iv.labs.arxiv.org/html/2309.10058,,2309.10058.pdf,Dual Student Networks for Data-Free Model Stealing,"Existing data-free model stealing methods use a generator to produce samples +in order to train a student model to match the target model outputs. To this +end, the two main challenges are estimating gradients of the target model +without access to its parameters, and generating a diverse set of training +samples that thoroughly explores the input space. We propose a Dual Student +method where two students are symmetrically trained in order to provide the +generator a criterion to generate samples that the two students disagree on. On +one hand, disagreement on a sample implies at least one student has classified +the sample incorrectly when compared to the target model. This incentive +towards disagreement implicitly encourages the generator to explore more +diverse regions of the input space. On the other hand, our method utilizes +gradients of student models to indirectly estimate gradients of the target +model. We show that this novel training objective for the generator network is +equivalent to optimizing a lower bound on the generator's loss if we had access +to the target model gradients. We show that our new optimization framework +provides more accurate gradient estimation of the target model and better +accuracies on benchmark classification datasets. Additionally, our approach +balances improved query efficiency with training computation cost. Finally, we +demonstrate that our method serves as a better proxy model for transfer-based +adversarial attacks than existing data-free model stealing methods.",cs.LG,"['cs.LG', 'cs.CR']" +MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation,Mi Yan · Jiazhao Zhang · Yan Zhu · He Wang,https://pku-epic.github.io/MaskClustering/,https://arxiv.org/abs/2401.07745,,2401.07745.pdf,MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation,"Open-vocabulary 3D instance segmentation is cutting-edge for its ability to +segment 3D instances without predefined categories. However, progress in 3D +lags behind its 2D counterpart due to limited annotated 3D data. To address +this, recent works first generate 2D open-vocabulary masks through 2D models +and then merge them into 3D instances based on metrics calculated between two +neighboring frames. In contrast to these local metrics, we propose a novel +metric, view consensus rate, to enhance the utilization of multi-view +observations. The key insight is that two 2D masks should be deemed part of the +same 3D instance if a significant number of other 2D masks from different views +contain both these two masks. Using this metric as edge weight, we construct a +global mask graph where each mask is a node. Through iterative clustering of +masks showing high view consensus, we generate a series of clusters, each +representing a distinct 3D instance. Notably, our model is training-free. +Through extensive experiments on publicly available datasets, including +ScanNet++, ScanNet200 and MatterPort3D, we demonstrate that our method achieves +state-of-the-art performance in open-vocabulary 3D instance segmentation. Our +project page is at https://pku-epic.github.io/MaskClustering.",cs.CV,['cs.CV'] +Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts,Jialin Wu · Xia Hu · Yaqing Wang · Bo Pang · Radu Soricut, ,https://arxiv.org/abs/2312.00968,,2312.00968.pdf,Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts,"Large multi-modal models (LMMs) exhibit remarkable performance across +numerous tasks. However, generalist LMMs often suffer from performance +degradation when tuned over a large collection of tasks. Recent research +suggests that Mixture of Experts (MoE) architectures are useful for instruction +tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost +of replicating and storing the expert models severely limits the number of +experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft +MoE approach to (softly) mix many multimodal low rank experts, and avoids +introducing a significant number of new parameters compared to conventional MoE +models. The core intuition here is that the large model provides a foundational +backbone, while different lightweight experts residually learn specialized +knowledge, either per-modality or multimodally. Extensive experiments +demonstrate that the SMoLA approach helps improve the generalist performance +across a broad range of generative vision-and-language tasks, achieving new +SoTA generalist performance that often matches or outperforms single +specialized LMM baselines, as well as new SoTA specialist performance.",cs.CV,"['cs.CV', 'cs.CL']" +SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation,Thuan Nguyen · Anh Tran,thuanz123.github.io/swiftbrush,https://arxiv.org/abs/2312.05239,,2312.05239.pdf,SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation,"Despite their ability to generate high-resolution and diverse images from +text prompts, text-to-image diffusion models often suffer from slow iterative +sampling processes. Model distillation is one of the most effective directions +to accelerate these models. However, previous distillation methods fail to +retain the generation quality while requiring a significant amount of images +for training, either from real data or synthetically generated by the teacher +model. In response to this limitation, we present a novel image-free +distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from +text-to-3D synthesis, in which a 3D neural radiance field that aligns with the +input prompt can be obtained from a 2D text-to-image diffusion prior via a +specialized loss without the use of any 3D data ground-truth, our approach +re-purposes that same loss for distilling a pretrained multi-step text-to-image +model to a student network that can generate high-fidelity images with just a +single inference step. In spite of its simplicity, our model stands as one of +the first one-step text-to-image generators that can produce images of +comparable quality to Stable Diffusion without reliance on any training image +data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a +CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive +results or even substantially surpassing existing state-of-the-art distillation +techniques.",cs.CV,['cs.CV'] +HardMo: A Large-Scale Hardcase Dataset for Motion Capture,Jiaqi Liao · Chuanchen Luo · Yinuo Du · Yuxi Wang · Xu-Cheng Yin · Man Zhang · Zhaoxiang Zhang · Junran Peng, ,,,,,,,nan +Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models,Huan Ling · Seung Wook Kim · Antonio Torralba · Sanja Fidler · Karsten Kreis, ,https://arxiv.org/abs/2312.13763,,2312.13763.pdf,Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models,"Text-guided diffusion models have revolutionized image and video generation +and have also been successfully used for optimization-based 3D object +synthesis. Here, we instead focus on the underexplored text-to-4D setting and +synthesize dynamic, animated 3D objects using score distillation methods with +an additional temporal dimension. Compared to previous work, we pursue a novel +compositional generation-based approach, and combine text-to-image, +text-to-video, and 3D-aware multiview diffusion models to provide feedback +during 4D object optimization, thereby simultaneously enforcing temporal +consistency, high-quality visual appearance and realistic geometry. Our method, +called Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with +deformation fields as 4D representation. Crucial to AYG is a novel method to +regularize the distribution of the moving 3D Gaussians and thereby stabilize +the optimization and induce motion. We also propose a motion amplification +mechanism as well as a new autoregressive synthesis scheme to generate and +combine multiple 4D sequences for longer generation. These techniques allow us +to synthesize vivid dynamic scenes, outperform previous work qualitatively and +quantitatively and achieve state-of-the-art text-to-4D performance. Due to the +Gaussian 4D representation, different 4D animations can be seamlessly combined, +as we demonstrate. AYG opens up promising avenues for animation, simulation and +digital content creation as well as synthetic data generation.",cs.CV,"['cs.CV', 'cs.LG']" +THRONE: A Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models,Prannay Kaul · Zhizhong Li · Hao Yang · Yonatan Dukler · Ashwin Swaminathan · CJ Taylor · Stefano Soatto · Stefano Soatto, ,,,,,,,nan +CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs,Yingji Zhong · Lanqing Hong · Zhenguo Li · Dan Xu, ,,,,,,,nan +Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction,Xiaoyang Lyu · Chirui Chang · Peng Dai · Yangtian Sun · Xiaojuan Qi, ,https://arxiv.org/abs/2403.19314,,2403.19314.pdf,Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction,"Scene reconstruction from multi-view images is a fundamental problem in +computer vision and graphics. Recent neural implicit surface reconstruction +methods have achieved high-quality results; however, editing and manipulating +the 3D geometry of reconstructed scenes remains challenging due to the absence +of naturally decomposed object entities and complex object/background +compositions. In this paper, we present Total-Decom, a novel method for +decomposed 3D reconstruction with minimal human interaction. Our approach +seamlessly integrates the Segment Anything Model (SAM) with hybrid +implicit-explicit neural surface representations and a mesh-based +region-growing technique for accurate 3D object decomposition. Total-Decom +requires minimal human annotations while providing users with real-time control +over the granularity and quality of decomposition. We extensively evaluate our +method on benchmark datasets and demonstrate its potential for downstream +applications, such as animation and scene editing. The code is available at +https://github.com/CVMI-Lab/Total-Decom.git.",cs.CV,['cs.CV'] +Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment,Ziyu Shan · Yujie Zhang · Qi Yang · Haichen Yang · Yiling Xu · Jenq-Neng Hwang · Xiaozhong Xu · Shan Liu, ,https://arxiv.org/abs/2403.10066,,2403.10066.pdf,Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment,"No-reference point cloud quality assessment (NR-PCQA) aims to automatically +evaluate the perceptual quality of distorted point clouds without available +reference, which have achieved tremendous improvements due to the utilization +of deep neural networks. However, learning-based NR-PCQA methods suffer from +the scarcity of labeled data and usually perform suboptimally in terms of +generalization. To solve the problem, we propose a novel contrastive +pre-training framework tailored for PCQA (CoPA), which enables the pre-trained +model to learn quality-aware representations from unlabeled data. To obtain +anchors in the representation space, we project point clouds with different +distortions into images and randomly mix their local patches to form mixed +images with multiple distortions. Utilizing the generated anchors, we constrain +the pre-training process via a quality-aware contrastive loss following the +philosophy that perceptual quality is closely related to both content and +distortion. Furthermore, in the model fine-tuning stage, we propose a +semantic-guided multi-view fusion module to effectively integrate the features +of projected images from multiple perspectives. Extensive experiments show that +our method outperforms the state-of-the-art PCQA methods on popular benchmarks. +Further investigations demonstrate that CoPA can also benefit existing +learning-based PCQA models.",cs.CV,"['cs.CV', 'cs.MM']" +PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding,Xuesong Nie · Haoyuan Jin · Yunfeng Yan · Xi Chen · Zhihang Zhu · Donglian Qi, ,http://export.arxiv.org/abs/2310.18698,,2310.18698.pdf,Triplet Attention Transformer for Spatiotemporal Predictive Learning,"Spatiotemporal predictive learning offers a self-supervised learning paradigm +that enables models to learn both spatial and temporal patterns by predicting +future sequences based on historical sequences. Mainstream methods are +dominated by recurrent units, yet they are limited by their lack of +parallelization and often underperform in real-world scenarios. To improve +prediction quality while maintaining computational efficiency, we propose an +innovative triplet attention transformer designed to capture both inter-frame +dynamics and intra-frame static features. Specifically, the model incorporates +the Triplet Attention Module (TAM), which replaces traditional recurrent units +by exploring self-attention mechanisms in temporal, spatial, and channel +dimensions. In this configuration: (i) temporal tokens contain abstract +representations of inter-frame, facilitating the capture of inherent temporal +dependencies; (ii) spatial and channel attention combine to refine the +intra-frame representation by performing fine-grained interactions across +spatial and channel dimensions. Alternating temporal, spatial, and +channel-level attention allows our approach to learn more complex short- and +long-range spatiotemporal dependencies. Extensive experiments demonstrate +performance surpassing existing recurrent-based and recurrent-free methods, +achieving state-of-the-art under multi-scenario examination including moving +object trajectory prediction, traffic flow prediction, driving scene +prediction, and human motion capture.",cs.CV,"['cs.CV', 'cs.LG']" +U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation,You Wu · Kean Liu · Xiaoyue Mi · Fan Tang · Juan Cao · Jintao Li, ,https://arxiv.org/abs/2403.20231,,2403.20231.pdf,U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation,"Concept personalization methods enable large text-to-image models to learn +specific subjects (e.g., objects/poses/3D models) and synthesize renditions in +new contexts. Given that the image references are highly biased towards visual +attributes, state-of-the-art personalization models tend to overfit the whole +subject and cannot disentangle visual characteristics in pixel space. In this +study, we proposed a more challenging setting, namely fine-grained visual +appearance personalization. Different from existing methods, we allow users to +provide a sentence describing the desired attributes. A novel decoupled +self-augmentation strategy is proposed to generate target-related and +non-target samples to learn user-specified visual attributes. These augmented +data allow for refining the model's understanding of the target attribute while +mitigating the impact of unrelated attributes. At the inference stage, +adjustments are conducted on semantic space through the learned target and +non-target embeddings to further enhance the disentanglement of target +attributes. Extensive experiments on various kinds of visual attributes with +SOTA personalization methods show the ability of the proposed method to mimic +target visual appearance in novel contexts, thus improving the controllability +and flexibility of personalization.",cs.CV,['cs.CV'] +OVMR: Open-Vocabulary Recognition with Multi-Modal References,Zehong Ma · Shiliang Zhang · Longhui Wei · Qi Tian, ,https://arxiv.org/abs/2306.05493,,2306.05493.pdf,Multi-Modal Classifiers for Open-Vocabulary Object Detection,"The goal of this paper is open-vocabulary object detection (OVOD) +$\unicode{x2013}$ building a model that can detect objects beyond the set of +categories seen at training, thus enabling the user to specify categories of +interest at inference without the need for model retraining. We adopt a +standard two-stage object detector architecture, and explore three ways for +specifying novel categories: via language descriptions, via image exemplars, or +via a combination of the two. We make three contributions: first, we prompt a +large language model (LLM) to generate informative language descriptions for +object classes, and construct powerful text-based classifiers; second, we +employ a visual aggregator on image exemplars that can ingest any number of +images as input, forming vision-based classifiers; and third, we provide a +simple method to fuse information from language descriptions and image +exemplars, yielding a multi-modal classifier. When evaluating on the +challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our +text-based classifiers outperform all previous OVOD works; (ii) our +vision-based classifiers perform as well as text-based classifiers in prior +work; (iii) using multi-modal classifiers perform better than either modality +alone; and finally, (iv) our text-based and multi-modal classifiers yield +better performance than a fully-supervised detector.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'I.4.6; I.4.8; I.4.9; I.2.10']" +Dynamic Prompt Optimizing for Text-to-Image Generation,Wenyi Mo · Tianyu Zhang · Yalong Bai · Bing Su · Ji-Rong Wen · Qing Yang, ,https://arxiv.org/abs/2404.04095,,2404.04095.pdf,Dynamic Prompt Optimizing for Text-to-Image Generation,"Text-to-image generative models, specifically those based on diffusion models +like Imagen and Stable Diffusion, have made substantial advancements. Recently, +there has been a surge of interest in the delicate refinement of text prompts. +Users assign weights or alter the injection time steps of certain words in the +text prompts to improve the quality of generated images. However, the success +of fine-control prompts depends on the accuracy of the text prompts and the +careful selection of weights and time steps, which requires significant manual +intervention. To address this, we introduce the \textbf{P}rompt +\textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original +prompts for image generation, we further employ an online reinforcement +learning strategy to explore the weights and injection time steps of each word, +leading to the dynamic fine-control prompts. The reward function during +training encourages the model to consider aesthetic score, semantic +consistency, and user preferences. Experimental results demonstrate that our +proposed method effectively improves the original prompts, generating visually +more appealing images while maintaining semantic alignment. Code is available +at https://github.com/Mowenyii/PAE.",cs.CV,"['cs.CV', 'cs.AI']" +DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery,Yixuan Zhu · Ao Li · Yansong Tang · Wenliang Zhao · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2404.01424,,2404.01424.pdf,DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery,"The recovery of occluded human meshes presents challenges for current methods +due to the difficulty in extracting effective image features under severe +occlusion. In this paper, we introduce DPMesh, an innovative framework for +occluded human mesh recovery that capitalizes on the profound diffusion prior +about object structure and spatial relationships embedded in a pre-trained +text-to-image diffusion model. Unlike previous methods reliant on conventional +backbones for vanilla feature extraction, DPMesh seamlessly integrates the +pre-trained denoising U-Net with potent knowledge as its image backbone and +performs a single-step inference to provide occlusion-aware information. To +enhance the perception capability for occluded poses, DPMesh incorporates +well-designed guidance via condition injection, which produces effective +controls from 2D observations for the denoising U-Net. Furthermore, we explore +a dedicated noisy key-point reasoning approach to mitigate disturbances arising +from occlusion and crowded scenarios. This strategy fully unleashes the +perceptual capability of the diffusion prior, thereby enhancing accuracy. +Extensive experiments affirm the efficacy of our framework, as we outperform +state-of-the-art methods on both occlusion-specific and standard datasets. The +persuasive results underscore its ability to achieve precise and robust 3D +human mesh recovery, particularly in challenging scenarios involving occlusion +and crowded scenes.",cs.CV,['cs.CV'] +Learning Inclusion Matching for Animation Paint Bucket Colorization,Yuekun Dai · Shangchen Zhou · Blake Li · Chongyi Li · Chen Change Loy,https://ykdai.github.io/projects/InclusionMatching,https://arxiv.org/abs/2403.18342,,2403.18342.pdf,Learning Inclusion Matching for Animation Paint Bucket Colorization,"Colorizing line art is a pivotal task in the production of hand-drawn cel +animation. This typically involves digital painters using a paint bucket tool +to manually color each segment enclosed by lines, based on RGB values +predetermined by a color designer. This frame-by-frame process is both arduous +and time-intensive. Current automated methods mainly focus on segment matching. +This technique migrates colors from a reference to the target frame by aligning +features within line-enclosed segments across frames. However, issues like +occlusion and wrinkles in animations often disrupt these direct +correspondences, leading to mismatches. In this work, we introduce a new +learning-based inclusion matching pipeline, which directs the network to +comprehend the inclusion relationships between segments rather than relying +solely on direct visual correspondences. Our method features a two-stage +pipeline that integrates a coarse color warping module with an inclusion +matching module, enabling more nuanced and accurate colorization. To facilitate +the training of our network, we also develope a unique dataset, referred to as +PaintBucket-Character. This dataset includes rendered line arts alongside their +colorized counterparts, featuring various 3D characters. Extensive experiments +demonstrate the effectiveness and superiority of our method over existing +techniques.",cs.CV,['cs.CV'] +Grounded Question-Answering in Long Egocentric Videos,Shangzhe Di · Weidi Xie,https://github.com/Becomebright/GroundVQA,https://arxiv.org/abs/2312.06505,,2312.06505.pdf,Grounded Question-Answering in Long Egocentric Videos,"Existing approaches to video understanding, mainly designed for short videos +from a third-person perspective, are limited in their applicability in certain +fields, such as robotics. In this paper, we delve into open-ended +question-answering (QA) in long, egocentric videos, which allows individuals or +robots to inquire about their own past visual experiences. This task presents +unique challenges, including the complexity of temporally grounding queries +within extensive video content, the high resource demands for precise data +annotation, and the inherent difficulty of evaluating open-ended answers due to +their ambiguous nature. Our proposed approach tackles these challenges by (i) +integrating query grounding and answering within a unified model to reduce +error propagation; (ii) employing large language models for efficient and +scalable data synthesis; and (iii) introducing a close-ended QA task for +evaluation, to manage answer ambiguity. Extensive experiments demonstrate the +effectiveness of our method, which also achieves state-of-the-art performance +on the QaEgo4D and Ego4D-NLQ benchmarks. Code, data, and models are available +at https://github.com/Becomebright/GroundVQA.",cs.CV,['cs.CV'] +SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models,Feifei Wang · Zhentao Tan · Tianyi Wei · Yue Wu · Qidong Huang, ,https://arxiv.org/abs/2312.07865,,2312.07865.pdf,SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models,"Despite the success of diffusion-based customization methods on visual +content creation, increasing concerns have been raised about such techniques +from both privacy and political perspectives. To tackle this issue, several +anti-customization methods have been proposed in very recent months, +predominantly grounded in adversarial attacks. Unfortunately, most of these +methods adopt straightforward designs, such as end-to-end optimization with a +focus on adversarially maximizing the original training loss, thereby +neglecting nuanced internal properties intrinsic to the diffusion model, and +even leading to ineffective optimization in some diffusion time steps.In this +paper, we strive to bridge this gap by undertaking a comprehensive exploration +of these inherent properties, to boost the performance of current +anti-customization approaches. Two aspects of properties are investigated: 1) +We examine the relationship between time step selection and the model's +perception in the frequency domain of images and find that lower time steps can +give much more contributions to adversarial noises. This inspires us to propose +an adaptive greedy search for optimal time steps that seamlessly integrates +with existing anti-customization methods. 2) We scrutinize the roles of +features at different layers during denoising and devise a sophisticated +feature-based optimization framework for anti-customization.Experiments on +facial benchmarks demonstrate that our approach significantly increases +identity disruption, thereby protecting user privacy and copyright. Our code is +available at: https://github.com/somuchtome/SimAC.",cs.CV,['cs.CV'] +DYSON: Dynamic Feature Space Self-Organization for Online Task-Free Class Incremental Learning,Yuhang He · YingJie Chen · Yuhan Jin · Songlin Dong · Xing Wei · Yihong Gong, ,https://arxiv.org/abs/2405.08533,,2405.08533.pdf,Dynamic Feature Learning and Matching for Class-Incremental Learning,"Class-incremental learning (CIL) has emerged as a means to learn new classes +incrementally without catastrophic forgetting of previous classes. Recently, +CIL has undergone a paradigm shift towards dynamic architectures due to their +superior performance. However, these models are still limited by the following +aspects: (i) Data augmentation (DA), which are tightly coupled with CIL, +remains under-explored in dynamic architecture scenarios. (ii) Feature +representation. The discriminativeness of dynamic feature are sub-optimal and +possess potential for refinement. (iii) Classifier. The misalignment between +dynamic feature and classifier constrains the capabilities of the model. To +tackle the aforementioned drawbacks, we propose the Dynamic Feature Learning +and Matching (DFLM) model in this paper from above three perspectives. +Specifically, we firstly introduce class weight information and non-stationary +functions to extend the mix DA method for dynamically adjusting the focus on +memory during training. Then, von Mises-Fisher (vMF) classifier is employed to +effectively model the dynamic feature distribution and implicitly learn their +discriminative properties. Finally, the matching loss is proposed to facilitate +the alignment between the learned dynamic features and the classifier by +minimizing the distribution distance. Extensive experiments on CIL benchmarks +validate that our proposed model achieves significant performance improvements +over existing methods.",cs.CV,['cs.CV'] +NightCC: Nighttime Color Constancy via Adaptive Channel Masking,Shuwei Li · Robby T. Tan, ,,,,,,,nan +G$^3$-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding,Yuan Wang · Yali Li · Shengjin Wang, ,https://arxiv.org/abs/2403.08182,,2403.08182.pdf,SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention,"3D visual grounding aims to automatically locate the 3D region of the +specified object given the corresponding textual description. Existing works +fail to distinguish similar objects especially when multiple referred objects +are involved in the description. Experiments show that direct matching of +language and visual modal has limited capacity to comprehend complex +referential relationships in utterances. It is mainly due to the interference +caused by redundant visual information in cross-modal alignment. To strengthen +relation-orientated mapping between different modalities, we propose SeCG, a +semantic-enhanced relational learning model based on a graph network with our +designed memory graph attention layer. Our method replaces original +language-independent encoding with cross-modal encoding in visual analysis. +More text-related feature expressions are obtained through the guidance of +global semantics and implicit relationships. Experimental results on ReferIt3D +and ScanRefer benchmarks show that the proposed method outperforms the existing +state-of-the-art methods, particularly improving the localization performance +for the multi-relation challenges.",cs.CV,['cs.CV'] +Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation,Yunhe Gao, ,https://arxiv.org/abs/2306.02416,,2306.02416.pdf,Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation,"A major focus of clinical imaging workflow is disease diagnosis and +management, leading to medical imaging datasets strongly tied to specific +clinical objectives. This scenario has led to the prevailing practice of +developing task-specific segmentation models, without gaining insights from +widespread imaging cohorts. Inspired by the training program of medical +radiology residents, we propose a shift towards universal medical image +segmentation, a paradigm aiming to build medical image understanding foundation +models by leveraging the diversity and commonality across clinical targets, +body regions, and imaging modalities. Towards this goal, we develop Hermes, a +novel context-prior learning approach to address the challenges of data +heterogeneity and annotation differences in medical image segmentation. In a +large collection of eleven diverse datasets (2,438 3D images) across five +modalities (CT, PET, T1, T2 and cine MRI) and multiple body regions, we +demonstrate the merit of the universal paradigm over the traditional paradigm +on addressing multiple tasks within a single model. By exploiting the synergy +across tasks, Hermes achieves state-of-the-art performance on all testing +datasets and shows superior model scalability. Results on two additional +datasets reveals Hermes' strong performance for transfer learning, incremental +learning, and generalization to downstream tasks. Hermes's learned priors +demonstrate an appealing trait to reflect the intricate relations among tasks +and modalities, which aligns with the established anatomical and imaging +principles in radiology. The code is available: +https://github.com/yhygao/universal-medical-image-segmentation.",cs.CV,['cs.CV'] +Generative Quanta Color Imaging,Vishal Purohit · Junjie Luo · Yiheng Chi · Qi Guo · Stanley H. Chan · Qiang Qiu, ,https://arxiv.org/abs/2403.19066,,2403.19066.pdf,Generative Quanta Color Imaging,"The astonishing development of single-photon cameras has created an +unprecedented opportunity for scientific and industrial imaging. However, the +high data throughput generated by these 1-bit sensors creates a significant +bottleneck for low-power applications. In this paper, we explore the +possibility of generating a color image from a single binary frame of a +single-photon camera. We evidently find this problem being particularly +difficult to standard colorization approaches due to the substantial degree of +exposure variation. The core innovation of our paper is an exposure synthesis +model framed under a neural ordinary differential equation (Neural ODE) that +allows us to generate a continuum of exposures from a single observation. This +innovation ensures consistent exposure in binary images that colorizers take +on, resulting in notably enhanced colorization. We demonstrate applications of +the method in single-image and burst colorization and show superior generative +performance over baselines. Project website can be found at +https://vishal-s-p.github.io/projects/2023/generative_quanta_color.html.",cs.CV,"['cs.CV', 'cs.AI']" +Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts,Dominik Scheuble · Chenyang Lei · Mario Bijelic · Seung-Hwan Baek · Felix Heide, ,,https://cg.postech.ac.kr/2024/03/01/9-papers-are-accepted-to-cvpr-2024/,,,,,nan +MirageRoom: 3D Scene Segmentation with 2D Pre-trained Models by Mirage Projection,Haowen Sun · Yueqi Duan · Juncheng Yan · Yifan Liu · Jiwen Lu, ,https://arxiv.org/abs/2403.06403,,2403.06403.pdf,PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models,"Recent success of vision foundation models have shown promising performance +for the 2D perception tasks. However, it is difficult to train a 3D foundation +network directly due to the limited dataset and it remains under explored +whether existing foundation models can be lifted to 3D space seamlessly. In +this paper, we present PointSeg, a novel training-free paradigm that leverages +off-the-shelf vision foundation models to address 3D scene perception tasks. +PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to +align their corresponding pixels across frames. Concretely, we design a +two-branch prompts learning structure to construct the 3D point-box prompts +pairs, combining with the bidirectional matching strategy for accurate point +and proposal prompts generation. Then, we perform the iterative post-refinement +adaptively when cooperated with different vision foundation models. Moreover, +we design a affinity-aware merging algorithm to improve the final ensemble +masks. PointSeg demonstrates impressive segmentation performance across various +datasets, all without training. Specifically, our approach significantly +surpasses the state-of-the-art specialist model by 13.4$\%$, 11.3$\%$, and +12$\%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets, respectively. On top +of that, PointSeg can incorporate with various segmentation models and even +surpasses the supervised methods.",cs.CV,['cs.CV'] +Overcoming Data Limitations for High-Quality Video Diffusion Models,Haoxin Chen · Yong Zhang · Xiaodong Cun · Menghan Xia · Xintao Wang · CHAO WENG · Ying Shan, ,,,,,,,nan +FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion,George Cazenavette · Avneesh Sud · Thomas Leung · Ben Usman, ,https://ar5iv.labs.arxiv.org/html/2210.06998,,2210.06998.pdf,DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models,"Text-to-image generation models that generate images based on prompt +descriptions have attracted an increasing amount of attention during the past +few months. Despite their encouraging performance, these models raise concerns +about the misuse of their generated fake images. To tackle this problem, we +pioneer a systematic study on the detection and attribution of fake images +generated by text-to-image generation models. Concretely, we first build a +machine learning classifier to detect the fake images generated by various +text-to-image generation models. We then attribute these fake images to their +source models, such that model owners can be held responsible for their models' +misuse. We further investigate how prompts that generate fake images affect +detection and attribution. We conduct extensive experiments on four popular +text-to-image generation models, including DALL$\cdot$E 2, Stable Diffusion, +GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical +results show that (1) fake images generated by various models can be +distinguished from real ones, as there exists a common artifact shared by fake +images from different models; (2) fake images can be effectively attributed to +their source models, as different models leave unique fingerprints in their +generated images; (3) prompts with the ``person'' topic or a length between 25 +and 75 enable models to generate fake images with higher authenticity. All +findings contribute to the community's insight into the threats caused by +text-to-image generation models. We appeal to the community's consideration of +the counterpart solutions, like ours, against the rapidly-evolving fake image +generation.",cs.CR,"['cs.CR', 'cs.CV', 'cs.LG']" +"Separating the ""Chirp"" from the ""Chat"": Self-supervised Visual Grounding of Sound and Language",Mark Hamilton · Andrew Zisserman · John Hershey · William Freeman, ,https://arxiv.org/abs/2404.19696,,,Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,"3D visual grounding is a challenging task that often requires direct and +dense supervision, notably the semantic label for each object in the scene. In +this paper, we instead study the naturally supervised setting that learns from +only 3D scene and QA pairs, where prior works underperform. We propose the +Language-Regularized Concept Learner (LARC), which uses constraints from +language as regularization to significantly improve the accuracy of +neuro-symbolic concept learners in the naturally supervised setting. Our +approach is based on two core insights: the first is that language constraints +(e.g., a word's relation to another) can serve as effective regularization for +structured representations in neuro-symbolic models; the second is that we can +query large language models to distill such constraints from language +properties. We show that LARC improves performance of prior works in naturally +supervised 3D visual grounding, and demonstrates a wide range of 3D visual +reasoning capabilities-from zero-shot composition, to data efficiency and +transferability. Our method represents a promising step towards regularizing +structured visual reasoning frameworks with language-based priors, for learning +in settings without dense supervision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models,Weiwei Cao · Jianpeng Zhang · Yingda Xia · Tony C. W. MOK · Zi Li · Xianghua Ye · Le Lu · Jian Zheng · Yuxing Tang · Ling Zhang, ,https://arxiv.org/abs/2404.04936,,2404.04936.pdf,Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models,"Radiologists highly desire fully automated versatile AI for medical imaging +interpretation. However, the lack of extensively annotated large-scale +multi-disease datasets has hindered the achievement of this goal. In this +paper, we explore the feasibility of leveraging language as a naturally +high-quality supervision for chest CT imaging. In light of the limited +availability of image-report pairs, we bootstrap the understanding of 3D chest +CT images by distilling chest-related diagnostic knowledge from an extensively +pre-trained 2D X-ray expert model. Specifically, we propose a language-guided +retrieval method to match each 3D CT image with its semantically closest 2D +X-ray image, and perform pair-wise and semantic relation knowledge +distillation. Subsequently, we use contrastive learning to align images and +reports within the same patient while distinguishing them from the other +patients. However, the challenge arises when patients have similar semantic +diagnoses, such as healthy patients, potentially confusing if treated as +negatives. We introduce a robust contrastive learning that identifies and +corrects these false negatives. We train our model with over 12,000 pairs of +chest CT images and radiology reports. Extensive experiments across multiple +scenarios, including zero-shot learning, report generation, and fine-tuning +processes, demonstrate the model's feasibility in interpreting chest CT images.",cs.CV,['cs.CV'] +Towards Automated Movie Trailer Generation,Dawit Argaw Argaw · Mattia Soldan · Alejandro Pardo · Chen Zhao · Fabian Caba Heilbron · Joon Chung · Bernard Ghanem, ,https://arxiv.org/abs/2404.03477,,2404.03477.pdf,Towards Automated Movie Trailer Generation,"Movie trailers are an essential tool for promoting films and attracting +audiences. However, the process of creating trailers can be time-consuming and +expensive. To streamline this process, we propose an automatic trailer +generation framework that generates plausible trailers from a full movie by +automating shot selection and composition. Our approach draws inspiration from +machine translation techniques and models the movies and trailers as sequences +of shots, thus formulating the trailer generation problem as a +sequence-to-sequence task. We introduce Trailer Generation Transformer (TGT), a +deep-learning framework utilizing an encoder-decoder architecture. TGT movie +encoder is tasked with contextualizing each movie shot representation via +self-attention, while the autoregressive trailer decoder predicts the feature +representation of the next trailer shot, accounting for the relevance of shots' +temporal order in trailers. Our TGT significantly outperforms previous methods +on a comprehensive suite of metrics.",cs.CV,['cs.CV'] +COCONut: Modernizing COCO Segmentation,Xueqing Deng · Qihang Yu · Peng Wang · Xiaohui Shen · Liang-Chieh Chen, ,,,,,,,nan +Investigating Compositional Challenges in Vision-Language Models for Visual Grounding,Yunan Zeng · Yan Huang · Jinjin Zhang · Zequn Jie · Zhenhua Chai · Liang Wang, ,https://arxiv.org/html/2405.17104v1,,2405.17104v1.pdf,LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding,"Visual grounding is an essential tool that links user-provided text queries +with query-specific regions within an image. Despite advancements in visual +grounding models, their ability to comprehend complex queries remains limited. +To overcome this limitation, we introduce LLM-Optic, an innovative method that +utilizes Large Language Models (LLMs) as an optical lens to enhance existing +visual grounding models in comprehending complex text queries involving +intricate text structures, multiple objects, or object spatial relationships, +situations that current models struggle with. LLM-Optic first employs an LLM as +a Text Grounder to interpret complex text queries and accurately identify +objects the user intends to locate. Then a pre-trained visual grounding model +is used to generate candidate bounding boxes given the refined query by the +Text Grounder. After that, LLM-Optic annotates the candidate bounding boxes +with numerical marks to establish a connection between text and specific image +regions, thereby linking two distinct modalities. Finally, it employs a Large +Multimodal Model (LMM) as a Visual Grounder to select the marked candidate +objects that best correspond to the original text query. Through LLM-Optic, we +have achieved universal visual grounding, which allows for the detection of +arbitrary objects specified by arbitrary human language input. Importantly, our +method achieves this enhancement without requiring additional training or +fine-tuning. Extensive experiments across various challenging benchmarks +demonstrate that LLM-Optic achieves state-of-the-art zero-shot visual grounding +capabilities.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network,Quan Zhang · Lei Wang · Vishal M. Patel · Xiaohua Xie · Jianhuang Lai, ,https://arxiv.org/abs/2403.14513,,2403.14513.pdf,View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network,"Existing person re-identification methods have achieved remarkable advances +in appearance-based identity association across homogeneous cameras, such as +ground-ground matching. However, as a more practical scenario, aerial-ground +person re-identification (AGPReID) among heterogeneous cameras has received +minimal attention. To alleviate the disruption of discriminative identity +representation by dramatic view discrepancy as the most significant challenge +in AGPReID, the view-decoupled transformer (VDT) is proposed as a simple yet +effective framework. Two major components are designed in VDT to decouple +view-related and view-unrelated features, namely hierarchical subtractive +separation and orthogonal loss, where the former separates these two features +inside the VDT, and the latter constrains these two to be independent. In +addition, we contribute a large-scale AGPReID dataset called CARGO, consisting +of five/eight aerial/ground cameras, 5,000 identities, and 108,563 images. +Experiments on two datasets show that VDT is a feasible and effective solution +for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on +CARGO and 3.7%/5.2% on AG-ReID, keeping the same magnitude of computational +complexity. Our project is available at https://github.com/LinlyAC/VDT-AGPReID",cs.CV,['cs.CV'] +Towards Accurate Post-training Quantization for Diffusion Models,Changyuan Wang · Ziwei Wang · Xiuwei Xu · Yansong Tang · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2404.05662,,2404.05662.pdf,Towards Accurate Binarization of Diffusion Model,"With the advancement of diffusion models (DMs) and the substantially +increased computational requirements, quantization emerges as a practical +solution to obtain compact and efficient low-bit DMs. However, the highly +discrete representation leads to severe accuracy degradation, hindering the +quantization of diffusion models to ultra-low bit-widths. This paper proposes a +novel quantization-aware training approach for DMs, namely BinaryDM. The +proposed method pushes DMs' weights toward accurate and efficient binarization, +considering the representation and computation properties. From the +representation perspective, we present a Learnable Multi-basis Binarizer (LMB) +to recover the representations generated by the binarized DM. The LMB enhances +detailed information through the flexible combination of dual binary bases +while applying to parameter-sparse locations of DM architectures to achieve +minor burdens. From the optimization perspective, a Low-rank Representation +Mimicking (LRM) is applied to assist the optimization of binarized DMs. The LRM +mimics the representations of full-precision DMs in low-rank space, alleviating +the direction ambiguity of the optimization process caused by fine-grained +alignment. Moreover, a quick progressive warm-up is applied to BinaryDM, +avoiding convergence difficulties by layerwisely progressive quantization at +the beginning of training. Comprehensive experiments demonstrate that BinaryDM +achieves significant accuracy and efficiency gains compared to SOTA +quantization methods of DMs under ultra-low bit-widths. With 1.1-bit weight and +4-bit activation (W1.1A4), BinaryDM achieves as low as 7.11 FID and saves the +performance from collapse (baseline FID 39.69). As the first binarization +method for diffusion models, W1.1A4 BinaryDM achieves impressive 9.3 times OPs +and 24.8 times model size savings, showcasing its substantial potential for +edge deployment.",cs.CV,['cs.CV'] +Density-Adaptive Model Based on Motif Matrix for Multi-Agent Trajectory Prediction,Di Wen · Haoran Xu · Zhaocheng He · Zhe Wu · Guang Tan · Peixi Peng, ,,https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/itr2.12502,,,,,nan +Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement,Han Wu · Guanyan Ou · Weibin Wu · Zibin Zheng, ,https://arxiv.org/abs/2312.04913,,2312.04913.pdf,SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation,"Current Visual-Language Pre-training (VLP) models are vulnerable to +adversarial examples. These adversarial examples present substantial security +risks to VLP models, as they can leverage inherent weaknesses in the models, +resulting in incorrect predictions. In contrast to white-box adversarial +attacks, transfer attacks (where the adversary crafts adversarial examples on a +white-box model to fool another black-box model) are more reflective of +real-world scenarios, thus making them more meaningful for research. By +summarizing and analyzing existing research, we identified two factors that can +influence the efficacy of transfer attacks on VLP models: inter-modal +interaction and data diversity. Based on these insights, we propose a +self-augment-based transfer attack method, termed SA-Attack. Specifically, +during the generation of adversarial images and adversarial texts, we apply +different data augmentation methods to the image modality and text modality, +respectively, with the aim of improving the adversarial transferability of the +generated adversarial images and texts. Experiments conducted on the FLickr30K +and COCO datasets have validated the effectiveness of our method. Our code will +be available after this paper is accepted.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR', 'cs.LG']" +Disentangled Prompt Representation for Domain Generalization,De Cheng · Zhipeng Xu · XINYANG JIANG · Nannan Wang · Dongsheng Li · Xinbo Gao, ,https://arxiv.org/abs/2403.08506,,,DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning,"Federated learning (FL) has emerged as a powerful paradigm for learning from +decentralized data, and federated domain generalization further considers the +test dataset (target domain) is absent from the decentralized training data +(source domains). However, most existing FL methods assume that domain labels +are provided during training, and their evaluation imposes explicit constraints +on the number of domains, which must strictly match the number of clients. +Because of the underutilization of numerous edge devices and additional +cross-client domain annotations in the real world, such restrictions may be +impractical and involve potential privacy leaks. In this paper, we propose an +efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a +method that tackles the above restrictions by learning adaptive prompts for +domain generalization in a distributed manner. Specifically, we first design +two types of prompts, i.e., global prompt to capture general knowledge across +all clients and domain prompts to capture domain-specific knowledge. They +eliminate the restriction on the one-to-one mapping between source domains and +local clients. Furthermore, a dynamic query metric is introduced to +automatically search the suitable domain label for each sample, which includes +two-substep text-image alignments based on prompt tuning without +labor-intensive annotation. Extensive experiments on multiple datasets +demonstrate that our DiPrompT achieves superior domain generalization +performance over state-of-the-art FL methods when domain labels are not +provided, and even outperforms many centralized learning methods using domain +labels.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner,Mengfei Xia · Yujun Shen · Changsong Lei · Yu Zhou · Deli Zhao · Ran Yi · Wenping Wang · Yong-Jin Liu, ,https://arxiv.org/abs/2310.09469,,2310.09469.pdf,Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner,"A diffusion model, which is formulated to produce an image using thousands of +denoising steps, usually suffers from a slow inference speed. Existing +acceleration algorithms simplify the sampling by skipping most steps yet +exhibit considerable performance degradation. By viewing the generation of +diffusion models as a discretized integrating process, we argue that the +quality drop is partly caused by applying an inaccurate integral direction to a +timestep interval. To rectify this issue, we propose a timestep aligner that +helps find a more accurate integral direction for a particular interval at the +minimum cost. Specifically, at each denoising step, we replace the original +parameterization by conditioning the network on a new timestep, which is +obtained by aligning the sampling distribution to the real distribution. +Extensive experiments show that our plug-in design can be trained efficiently +and boost the inference performance of various state-of-the-art acceleration +methods, especially when there are few denoising steps. For example, when using +10 denoising steps on the popular LSUN Bedroom dataset, we improve the FID of +DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate +set of timesteps. Code will be made publicly available.",cs.CV,['cs.CV'] +AutoAD III: The Prequel -- Back to the Pixels,Tengda Han · Max Bain · Arsha Nagrani · Gül Varol · Weidi Xie · Andrew Zisserman, ,https://arxiv.org/abs/2404.14412v1,,2404.14412v1.pdf,AutoAD III: The Prequel -- Back to the Pixels,"Generating Audio Description (AD) for movies is a challenging task that +requires fine-grained visual understanding and an awareness of the characters +and their names. Currently, visual language models for AD generation are +limited by a lack of suitable training data, and also their evaluation is +hampered by using performance measures not specialized to the AD domain. In +this paper, we make three contributions: (i) We propose two approaches for +constructing AD datasets with aligned video data, and build training and +evaluation datasets using these. These datasets will be publicly released; (ii) +We develop a Q-former-based architecture which ingests raw video and generates +AD, using frozen pre-trained visual encoders and large language models; and +(iii) We provide new evaluation metrics to benchmark AD quality that are +well-matched to human performance. Taken together, we improve the state of the +art on AD generation.",cs.CV,['cs.CV'] +Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting,Taeho Kang · Youngki Lee,https://tho-kn.github.io/projects/EgoTAP/,https://arxiv.org/abs/2402.18330,,2402.18330.pdf,Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting,"We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate +stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view +limbs in egocentric camera views make accurate pose estimation a challenging +problem. To address the challenge, prior methods employ joint +heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D +pose conversion still remains an inaccurate process. We propose a novel +heatmap-to-3D lifting method composed of the Grid ViT Encoder and the +Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into +effective feature embedding using self-attention. Then, the Propagation Network +estimates the 3D pose by utilizing skeletal information to better estimate the +position of obscure joints. Our method significantly outperforms the previous +state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\% +reduction of error in an MPJPE metric. Our source code is available in GitHub.",cs.CV,['cs.CV'] +Data Valuation and Detections in Federated Learning,Wenqian Li · Shuran Fu · Fengrui Zhang · Yan Pang,https://github.com/muz1lee/MOTdata/tree/main,https://arxiv.org/abs/2311.05304v2,,2311.05304v2.pdf,Data Valuation and Detections in Federated Learning,"Federated Learning (FL) enables collaborative model training while preserving +the privacy of raw data. A challenge in this framework is the fair and +efficient valuation of data, which is crucial for incentivizing clients to +contribute high-quality data in the FL task. In scenarios involving numerous +data clients within FL, it is often the case that only a subset of clients and +datasets are pertinent to a specific learning task, while others might have +either a negative or negligible impact on the model training process. This +paper introduces a novel privacy-preserving method for evaluating client +contributions and selecting relevant datasets without a pre-specified training +algorithm in an FL task. Our proposed approach FedBary, utilizes Wasserstein +distance within the federated context, offering a new solution for data +valuation in the FL framework. This method ensures transparent data valuation +and efficient computation of the Wasserstein barycenter and reduces the +dependence on validation datasets. Through extensive empirical experiments and +theoretical analyses, we demonstrate the potential of this data valuation +method as a promising avenue for FL research.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CR']" +WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion,Khiem Vuong · N. Dinesh Reddy · Robert Tamburo · Srinivasa G. Narasimhan, ,https://arxiv.org/abs/2403.19022,,2403.19022.pdf,WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion,"Current methods for 2D and 3D object understanding struggle with severe +occlusions in busy urban environments, partly due to the lack of large-scale +labeled ground-truth annotations for learning occlusion. In this work, we +introduce a novel framework for automatically generating a large, realistic +dataset of dynamic objects under occlusions using freely available time-lapse +imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) +and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects +are identified automatically and composited into the background in a clip-art +style, ensuring realistic appearances and physically accurate occlusion +configurations. The resulting clip-art image with pseudo-groundtruth enables +efficient training of object reconstruction methods that are robust to +occlusions. Our method demonstrates significant improvements in both 2D and 3D +reconstruction, particularly in scenarios with heavily occluded objects like +vehicles and people in urban scenes.",cs.CV,['cs.CV'] +Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models,Yushi Hu · Otilia Stretcu · Chun-Ta Lu · Krishnamurthy Viswanathan · Kenji Hata · Enming Luo · Ranjay Krishna · Ariel Fuxman, ,https://arxiv.org/abs/2312.03052,,2312.03052.pdf,Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models,"Solving complex visual tasks such as ""Who invented the musical instrument on +the right?"" involves a composition of skills: understanding space, recognizing +instruments, and also retrieving prior knowledge. Recent work shows promise by +decomposing such tasks using a large language model (LLM) into an executable +program that invokes specialized vision models. However, generated programs are +error-prone: they omit necessary steps, include spurious ones, and are unable +to recover when the specialized models give incorrect outputs. Moreover, they +require loading multiple models, incurring high latency and computation costs. +We propose Visual Program Distillation (VPD), an instruction tuning framework +that produces a vision-language model (VLM) capable of solving complex visual +tasks with a single forward pass. VPD distills the reasoning ability of LLMs by +using them to sample multiple candidate programs, which are then executed and +verified to identify a correct one. It translates each correct program into a +language description of the reasoning steps, which are then distilled into a +VLM. Extensive experiments show that VPD improves the VLM's ability to count, +understand spatial relations, and reason compositionally. Our VPD-trained +PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance +across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, +and Hateful Memes. An evaluation with human annotators also confirms that VPD +improves model response factuality and consistency. Finally, experiments on +content moderation demonstrate that VPD is also helpful for adaptation to +real-world applications with limited data.",cs.CV,"['cs.CV', 'cs.CL']" +Learning Multi-dimensional Human Preference for Text-to-Image Generation,Sixian Zhang · Bohan Wang · Junqiang Wu · Yan Li · Tingting Gao · Di ZHANG · Zhongyuan Wang,https://wangbohan97.github.io/MPS/,,,,,,,nan +IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation,Mengshun Hu · Kui Jiang · Zhihang Zhong · Zheng Wang · Yinqiang Zheng, ,https://arxiv.org/abs/2404.13534,,2404.13534.pdf,Motion-aware Latent Diffusion Models for Video Frame Interpolation,"With the advancement of AIGC, video frame interpolation (VFI) has become a +crucial component in existing video generation frameworks, attracting +widespread research interest. For the VFI task, the motion estimation between +neighboring frames plays a crucial role in avoiding motion ambiguity. However, +existing VFI methods always struggle to accurately predict the motion +information between consecutive frames, and this imprecise estimation leads to +blurred and visually incoherent interpolated frames. In this paper, we propose +a novel diffusion framework, motion-aware latent diffusion models (MADiff), +which is specifically designed for the VFI task. By incorporating motion priors +between the conditional neighboring frames with the target interpolated frame +predicted throughout the diffusion sampling procedure, MADiff progressively +refines the intermediate outcomes, culminating in generating both visually +smooth and realistic results. Extensive experiments conducted on benchmark +datasets demonstrate that our method achieves state-of-the-art performance +significantly outperforming existing approaches, especially under challenging +scenarios involving dynamic textures with complex motion.",cs.CV,['cs.CV'] +MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation,Hanzhe Hu · Zhizhuo Zhou · Varun Jampani · Shubham Tulsiani, ,https://arxiv.org/abs/2404.03656,,2404.03656.pdf,MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation,"We present MVD-Fusion: a method for single-view 3D inference via generative +modeling of multi-view-consistent RGB-D images. While recent methods pursuing +3D inference advocate learning novel-view generative models, these generations +are not 3D-consistent and require a distillation process to generate a 3D +output. We instead cast the task of 3D inference as directly generating +mutually-consistent multiple views and build on the insight that additionally +inferring depth can provide a mechanism for enforcing this consistency. +Specifically, we train a denoising diffusion model to generate multi-view RGB-D +images given a single RGB input image and leverage the (intermediate noisy) +depth estimates to obtain reprojection-based conditioning to maintain +multi-view consistency. We train our model using large-scale synthetic dataset +Obajverse as well as the real-world CO3D dataset comprising of generic camera +viewpoints. We demonstrate that our approach can yield more accurate synthesis +compared to recent state-of-the-art, including distillation-based 3D inference +and prior multi-view generation methods. We also evaluate the geometry induced +by our multi-view depth prediction and find that it yields a more accurate +representation than other direct 3D inference approaches.",cs.CV,['cs.CV'] +Video ReCap: Recursive Captioning of Hour-Long Videos,Md Mohaiminul Islam · Vu Bao Ngan Ho · Xitong Yang · Tushar Nagarajan · Lorenzo Torresani · Gedas Bertasius, ,https://arxiv.org/abs/2402.13250,,2402.13250.pdf,Video ReCap: Recursive Captioning of Hour-Long Videos,"Most video captioning models are designed to process short video clips of few +seconds and output text describing low-level visual concepts (e.g., objects, +scenes, atomic actions). However, most real-world videos last for minutes or +hours and have a complex hierarchical structure spanning different temporal +granularities. We propose Video ReCap, a recursive video captioning model that +can process video inputs of dramatically different lengths (from 1 second to 2 +hours) and output video captions at multiple hierarchy levels. The recursive +video-language architecture exploits the synergy between different video +hierarchies and can process hour-long videos efficiently. We utilize a +curriculum learning training scheme to learn the hierarchical structure of +videos, starting from clip-level captions describing atomic actions, then +focusing on segment-level descriptions, and concluding with generating +summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by +augmenting Ego4D with 8,267 manually collected long-range video summaries. Our +recursive model can flexibly generate captions at different hierarchy levels +while also being useful for other complex video understanding tasks, such as +VideoQA on EgoSchema. Data, code, and models are available at: +https://sites.google.com/view/vidrecap",cs.CV,['cs.CV'] +SuperSVG: Superpixel-based Scalable Vector Graphics Synthesis,Teng Hu · Ran Yi · Baihong Qian · Jiangning Zhang · Paul L. Rosin · Yu-Kun Lai, ,https://arxiv.org/html/2405.02962v1,,2405.02962v1.pdf,VectorPainter: A Novel Approach to Stylized Vector Graphics Synthesis with Vectorized Strokes,"We propose a novel method, VectorPainter, for the task of stylized vector +graphics synthesis. Given a text prompt and a reference style image, +VectorPainter generates a vector graphic that aligns in content with the text +prompt and remains faithful in style to the reference image. We recognize that +the key to this task lies in fully leveraging the intrinsic properties of +vector graphics. Innovatively, we conceptualize the stylization process as the +rearrangement of vectorized strokes extracted from the reference image. +VectorPainter employs an optimization-based pipeline. It begins by extracting +vectorized strokes from the reference image, which are then used to initialize +the synthesis process. To ensure fidelity to the reference style, a novel style +preservation loss is introduced. Extensive experiments have been conducted to +demonstrate that our method is capable of aligning with the text description +while remaining faithful to the reference image.",cs.CV,['cs.CV'] +GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds,Shengjun Zhang · Xin Fei · Yueqi Duan, ,https://arxiv.org/abs/2403.19220,,2403.19220.pdf,GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds,"Point clouds captured by different sensors such as RGB-D cameras and LiDAR +possess non-negligible domain gaps. Most existing methods design different +network architectures and train separately on point clouds from various +sensors. Typically, point-based methods achieve outstanding performances on +even-distributed dense point clouds from RGB-D cameras, while voxel-based +methods are more efficient for large-range sparse LiDAR point clouds. In this +paper, we propose geometry-to-voxel auxiliary learning to enable voxel +representations to access point-level geometric information, which supports +better generalisation of the voxel-based backbone with additional +interpretations of multi-sensor point clouds. Specifically, we construct +hierarchical geometry pools generated by a voxel-guided dynamic point network, +which efficiently provide auxiliary fine-grained geometric information adapted +to different stages of voxel features. We conduct experiments on joint +multi-sensor datasets to demonstrate the effectiveness of GeoAuxNet. Enjoying +elaborate geometric information, our method outperforms other models +collectively trained on multi-sensor datasets, and achieve competitive results +with the-state-of-art experts on each single dataset.",cs.CV,['cs.CV'] +BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics,Wenqian Zhang · Molin Huang · Yuxuan Zhou · Juze Zhang · Jingyi Yu · Jingya Wang · Lan Xu,https://github.com/Godheritage/BOTH2Hands,https://arxiv.org/abs/2312.07937,,2312.07937.pdf,BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics,"The recently emerging text-to-motion advances have spired numerous attempts +for convenient and interactive human motion generation. Yet, existing methods +are largely limited to generating body motions only without considering the +rich two-hand motions, let alone handling various conditions like body dynamics +or texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal +dataset for two-hand motion generation. Our dataset includes accurate motion +tracking for the human body and hands and provides pair-wised finger-level hand +annotations and body descriptions. We further provide a strong baseline method, +BOTH2Hands, for the novel task: generating vivid two-hand motions from both +implicit body dynamics and explicit text prompts. We first warm up two parallel +body-to-hand and text-to-hand diffusion models and then utilize the +cross-attention transformer for motion blending. Extensive experiments and +cross-validations demonstrate the effectiveness of our approach and dataset for +generating convincing two-hand motions from the hybrid body-and-textual +conditions. Our dataset and code will be disseminated to the community for +future research.",cs.CV,['cs.CV'] +Paint3D: Paint Anything 3D with Lighting-less Texture Diffusion Models,Xianfang Zeng · Xin Chen · Zhongqi Qi · Wen Liu · Zibo Zhao · Zhibin Wang · Bin Fu · Yong Liu · Gang Yu, ,https://arxiv.org/abs/2312.13913,,2312.13913.pdf,Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models,"This paper presents Paint3D, a novel coarse-to-fine generative framework that +is capable of producing high-resolution, lighting-less, and diverse 2K UV +texture maps for untextured 3D meshes conditioned on text or image inputs. The +key challenge addressed is generating high-quality textures without embedded +illumination information, which allows the textures to be re-lighted or +re-edited within modern graphics pipelines. To achieve this, our method first +leverages a pre-trained depth-aware 2D diffusion model to generate +view-conditional images and perform multi-view texture fusion, producing an +initial coarse texture map. However, as 2D models cannot fully represent 3D +shapes and disable lighting effects, the coarse texture map exhibits incomplete +areas and illumination artifacts. To resolve this, we train separate UV +Inpainting and UVHD diffusion models specialized for the shape-aware refinement +of incomplete areas and the removal of illumination artifacts. Through this +coarse-to-fine process, Paint3D can produce high-quality 2K UV textures that +maintain semantic consistency while being lighting-less, significantly +advancing the state-of-the-art in texturing 3D objects.",cs.CV,['cs.CV'] +Overload: Latency Attacks on Object Detection for Edge Devices,Erh-Chung Chen · Pin-Yu Chen · I-Hsin Chung · Che-Rung Lee, ,https://ar5iv.labs.arxiv.org/html/2304.05370,,2304.05370.pdf,Overload: Latency Attacks on Object Detection for Edge Devices,"Nowadays, the deployment of deep learning-based applications is an essential +task owing to the increasing demands on intelligent services. In this paper, we +investigate latency attacks on deep learning applications. Unlike common +adversarial attacks for misclassification, the goal of latency attacks is to +increase the inference time, which may stop applications from responding to the +requests within a reasonable time. This kind of attack is ubiquitous for +various applications, and we use object detection to demonstrate how such kind +of attacks work. We also design a framework named Overload to generate latency +attacks at scale. Our method is based on a newly formulated optimization +problem and a novel technique, called spatial attention. This attack serves to +escalate the required computing costs during the inference time, consequently +leading to an extended inference time for object detection. It presents a +significant threat, especially to systems with limited computing resources. We +conducted experiments using YOLOv5 models on Nvidia NX. Compared to existing +methods, our method is simpler and more effective. The experimental results +show that with latency attacks, the inference time of a single image can be +increased ten times longer in reference to the normal setting. Moreover, our +findings pose a potential new threat to all object detection tasks requiring +non-maximum suppression (NMS), as our attack is NMS-agnostic.",cs.CV,['cs.CV'] +OmniGlue: Generalizable Feature Matching with Foundation Model Guidance,Hanwen Jiang · Arjun Karpur · Bingyi Cao · Qixing Huang · André Araujo, ,https://arxiv.org/abs/2405.12979,,2405.12979.pdf,OmniGlue: Generalizable Feature Matching with Foundation Model Guidance,"The image matching field has been witnessing a continuous emergence of novel +learnable feature matching techniques, with ever-improving performance on +conventional benchmarks. However, our investigation shows that despite these +gains, their potential for real-world applications is restricted by their +limited generalization capabilities to novel image domains. In this paper, we +introduce OmniGlue, the first learnable image matcher that is designed with +generalization as a core principle. OmniGlue leverages broad knowledge from a +vision foundation model to guide the feature matching process, boosting +generalization to domains not seen at training time. Additionally, we propose a +novel keypoint position-guided attention mechanism which disentangles spatial +and appearance information, leading to enhanced matching descriptors. We +perform comprehensive experiments on a suite of $7$ datasets with varied image +domains, including scene-level, object-centric and aerial images. OmniGlue's +novel components lead to relative gains on unseen domains of $20.9\%$ with +respect to a directly comparable reference model, while also outperforming the +recent LightGlue method by $9.5\%$ relatively.Code and model can be found at +https://hwjiang1510.github.io/OmniGlue",cs.CV,['cs.CV'] +InstaGen: Enhancing Object Detection by Training on Synthetic Dataset,Chengjian Feng · Yujie Zhong · Zequn Jie · Weidi Xie · Lin Ma, ,https://arxiv.org/abs/2402.05937,,2402.05937.pdf,InstaGen: Enhancing Object Detection by Training on Synthetic Dataset,"In this paper, we present a novel paradigm to enhance the ability of object +detector, e.g., expanding categories or improving detection performance, by +training on synthetic dataset generated from diffusion models. Specifically, we +integrate an instance-level grounding head into a pre-trained, generative +diffusion model, to augment it with the ability of localising instances in the +generated images. The grounding head is trained to align the text embedding of +category names with the regional visual feature of the diffusion model, using +supervision from an off-the-shelf object detector, and a novel self-training +scheme on (novel) categories not covered by the detector. We conduct thorough +experiments to show that, this enhanced version of diffusion model, termed as +InstaGen, can serve as a data synthesizer, to enhance object detectors by +training on its generated samples, demonstrating superior performance over +existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse +(+1.2 to 5.2 AP) scenarios. Project page with code: +https://fcjian.github.io/InstaGen.",cs.CV,['cs.CV'] +LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering,Jaehoon Choi · Rajvi Shah · Qinbo Li · Yipeng Wang · Ayush Saraf · Changil Kim · Jia-Bin Huang · Dinesh Manocha · Suhib Alsisan · Johannes Kopf,https://jh-choi.github.io/LTMM,https://arxiv.org/html/2404.15891v2,,2404.15891v2.pdf,OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation,"Recent advancements in 3D reconstruction technologies have paved the way for +high-quality and real-time rendering of complex 3D scenes. Despite these +achievements, a notable challenge persists: it is difficult to precisely +reconstruct specific objects from large scenes. Current scene reconstruction +techniques frequently result in the loss of object detail textures and are +unable to reconstruct object portions that are occluded or unseen in views. To +address this challenge, we delve into the meticulous 3D reconstruction of +specific objects within large scenes and propose a framework termed OMEGAS: +Object Mesh Extraction from Large Scenes Guided by GAussian Segmentation. +OMEGAS employs a multi-step approach, grounded in several excellent +off-the-shelf methodologies. Specifically, initially, we utilize the Segment +Anything Model (SAM) to guide the segmentation of 3D Gaussian Splatting (3DGS), +thereby creating a basic 3DGS model of the target object. Then, we leverage +large-scale diffusion priors to further refine the details of the 3DGS model, +especially aimed at addressing invisible or occluded object portions from the +original scene views. Subsequently, by re-rendering the 3DGS model onto the +scene views, we achieve accurate object segmentation and effectively remove the +background. Finally, these target-only images are used to improve the 3DGS +model further and extract the definitive 3D object mesh by the SuGaR model. In +various scenarios, our experiments demonstrate that OMEGAS significantly +surpasses existing scene reconstruction methods. Our project page is at: +https://github.com/CrystalWlz/OMEGAS",cs.CV,['cs.CV'] +Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models,Pablo Marcos-Manchón · Roberto Alcover-Couso · Juan SanMiguel · Jose M. Martinez,https://github.com/vpulab/ovam,https://arxiv.org/abs/2403.14291v1,,2403.14291v1.pdf,Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models,"Diffusion models represent a new paradigm in text-to-image generation. Beyond +generating high-quality images from text prompts, models such as Stable +Diffusion have been successfully extended to the joint generation of semantic +segmentation pseudo-masks. However, current extensions primarily rely on +extracting attentions linked to prompt words used for image synthesis. This +approach limits the generation of segmentation masks derived from word tokens +not contained in the text prompt. In this work, we introduce Open-Vocabulary +Attention Maps (OVAM)-a training-free method for text-to-image diffusion models +that enables the generation of attention maps for any word. In addition, we +propose a lightweight optimization process based on OVAM for finding tokens +that generate accurate attention maps for an object class with a single +annotation. We evaluate these tokens within existing state-of-the-art Stable +Diffusion extensions. The best-performing model improves its mIoU from 52.1 to +86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized +tokens are an efficient way to improve the performance of existing methods +without architectural changes or retraining.",cs.CV,['cs.CV'] +Leveraging Frame Affinity for sRGB-to-RAW Video De-rendering,Chen Zhang · Wencheng Han · Yang Zhou · Jianbing Shen · Cheng-Zhong Xu · Wentao Liu, ,https://arxiv.org/abs/2404.09490,,2404.09490.pdf,Leveraging Temporal Contextualization for Video Action Recognition,"Pretrained vision-language models have shown effectiveness in video +understanding. However, recent studies have not sufficiently leveraged +essential temporal information from videos, simply averaging frame-wise +representations or referencing consecutive frames. We introduce Temporally +Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding +that effectively and efficiently leverages comprehensive video information. We +propose Temporal Contextualization (TC), a novel layer-wise temporal +information infusion mechanism for video that extracts core information from +each frame, interconnects relevant information across the video to summarize +into context tokens, and ultimately leverages the context tokens during the +feature encoding process. Furthermore, our Video-conditional Prompting (VP) +module manufactures context tokens to generate informative prompts in text +modality. We conduct extensive experiments in zero-shot, few-shot, +base-to-novel, and fully-supervised action recognition to validate the +superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design +choices. Code is available at https://github.com/naver-ai/tc-clip",cs.CV,['cs.CV'] +UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion,Junsheng Zhou · Weiqi Zhang · Baorui Ma · Kanle Shi · Yu-Shen Liu · Zhizhong Han, ,https://arxiv.org/abs/2404.06851,,2404.06851.pdf,UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion,"Diffusion models have shown remarkable results for image generation, editing +and inpainting. Recent works explore diffusion models for 3D shape generation +with neural implicit functions, i.e., signed distance function and occupancy +function. However, they are limited to shapes with closed surfaces, which +prevents them from generating diverse 3D real-world contents containing open +surfaces. In this work, we present UDiFF, a 3D diffusion model for unsigned +distance fields (UDFs) which is capable to generate textured 3D shapes with +open surfaces from text conditions or unconditionally. Our key idea is to +generate UDFs in spatial-frequency domain with an optimal wavelet +transformation, which produces a compact representation space for UDF +generation. Specifically, instead of selecting an appropriate wavelet +transformation which requires expensive manual efforts and still leads to large +information loss, we propose a data-driven approach to learn the optimal +wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by +numerical and visual comparisons with the latest methods on widely used +benchmarks. Page: https://weiqi-zhang.github.io/UDiFF.",cs.CV,['cs.CV'] +OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation,Xiongwei Wu · Sicheng Yu · Ee-Peng Lim · Chong Wah Ngo, ,https://arxiv.org/abs/2404.01409,,2404.01409.pdf,OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation,"In the realm of food computing, segmenting ingredients from images poses +substantial challenges due to the large intra-class variance among the same +ingredients, the emergence of new ingredients, and the high annotation costs +associated with large food segmentation datasets. Existing approaches primarily +utilize a closed-vocabulary and static text embeddings setting. These methods +often fall short in effectively handling the ingredients, particularly new and +diverse ones. In response to these limitations, we introduce OVFoodSeg, a +framework that adopts an open-vocabulary setting and enhances text embeddings +with visual context. By integrating vision-language models (VLMs), our approach +enriches text embedding with image-specific information through two innovative +modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text +Encoder. The training process of OVFoodSeg is divided into two stages: the +pre-training of FoodLearner and the subsequent learning phase for segmentation. +The pre-training phase equips FoodLearner with the capability to align visual +information with corresponding textual representations that are specifically +related to food, while the second phase adapts both the FoodLearner and the +Image-Informed Text Encoder for the segmentation task. By addressing the +deficiencies of previous models, OVFoodSeg demonstrates a significant +improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU) +on the FoodSeg103 dataset, setting a new milestone for food image segmentation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']" +LaneCPP: Continuous 3D Lane Detection using Physical Priors,Maximilian Pittner · Joel Janai · Alexandru Paul Condurache, ,https://arxiv.org/abs/2401.08036,,2401.08036.pdf,3D Lane Detection from Front or Surround-View using Joint-Modeling & Matching,"3D lanes offer a more comprehensive understanding of the road surface +geometry than 2D lanes, thereby providing crucial references for driving +decisions and trajectory planning. While many efforts aim to improve prediction +accuracy, we recognize that an efficient network can bring results closer to +lane modeling. However, if the modeling data is imprecise, the results might +not accurately capture the real-world scenario. Therefore, accurate lane +modeling is essential to align prediction results closely with the environment. +This study centers on efficient and accurate lane modeling, proposing a joint +modeling approach that combines Bezier curves and interpolation methods. +Furthermore, based on this lane modeling approach, we developed a Global2Local +Lane Matching method with Bezier Control-Point and Key-Point, which serve as a +comprehensive solution that leverages hierarchical features with two +mathematical models to ensure a precise match. We also introduce a novel 3D +Spatial Encoder, representing an exploration of 3D surround-view lane detection +research. The framework is suitable for front-view or surround-view 3D lane +detection. By directly outputting the key points of lanes in 3D space, it +overcomes the limitations of anchor-based methods, enabling accurate prediction +of closed-loop or U-shaped lanes and effective adaptation to complex road +conditions. This innovative method establishes a new benchmark in front-view 3D +lane detection on the Openlane dataset and achieves competitive performance in +surround-view 2D lane detection on the Argoverse2 dataset.",cs.CV,['cs.CV'] +MonoNPHM: Dynamic Head Reconstruction from Monocular Videos,Simon Giebenhain · Tobias Kirschstein · Markos Georgopoulos · Martin Rünz · Lourdes Agapito · Matthias Nießner,https://simongiebenhain.github.io/MonoNPHM/,https://arxiv.org/abs/2312.06740,,2312.06740.pdf,MonoNPHM: Dynamic Head Reconstruction from Monocular Videos,"We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D +head reconstructions from monocular RGB videos. To this end, we propose a +latent appearance space that parameterizes a texture field on top of a neural +parametric model. We constrain predicted color values to be correlated with the +underlying geometry such that gradients from RGB effectively influence latent +geometry codes during inverse rendering. To increase the representational +capacity of our expression space, we augment our backward deformation field +with hyper-dimensions, thus improving color and geometry representation in +topologically challenging expressions. Using MonoNPHM as a learned prior, we +approach the task of 3D head reconstruction using signed distance field based +volumetric rendering. By numerically inverting our backward deformation field, +we incorporated a landmark loss using facial anchor points that are closely +tied to our canonical geometry representation. To evaluate the task of dynamic +face reconstruction from monocular RGB videos we record 20 challenging Kinect +sequences under casual conditions. MonoNPHM outperforms all baselines with a +significant margin, and makes an important step towards easily accessible +neural parametric face models through RGB tracking.",cs.CV,['cs.CV'] +Retrieval-Augmented Egocentric Video Captioning,Jilan Xu · Yifei Huang · Junlin Hou · Guo Chen · Yuejie Zhang · Rui Feng · Weidi Xie, ,https://arxiv.org/abs/2401.00789,,2401.00789.pdf,Retrieval-Augmented Egocentric Video Captioning,"Understanding human actions from videos of first-person view poses +significant challenges. Most prior approaches explore representation learning +on egocentric videos only, while overlooking the potential benefit of +exploiting existing large-scale third-person videos. In this paper, (1) we +develop EgoInstructor, a retrieval-augmented multimodal captioning model that +automatically retrieves semantically relevant third-person instructional videos +to enhance the video captioning of egocentric videos. (2) For training the +cross-view retrieval module, we devise an automatic pipeline to discover +ego-exo video pairs from distinct large-scale egocentric and exocentric +datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE +loss that pulls egocentric and exocentric video features closer by aligning +them to shared text features that describe similar actions. (4) Through +extensive experiments, our cross-view retrieval module demonstrates superior +performance across seven benchmarks. Regarding egocentric video captioning, +EgoInstructor exhibits significant improvements by leveraging third-person +videos as references.",cs.CV,['cs.CV'] +Relaxed Contrastive Learning for Federated Learning,Seonguk Seo · Jinkyu Kim · Geeho Kim · Bohyung Han, ,https://arxiv.org/abs/2401.04928,,2401.04928.pdf,Relaxed Contrastive Learning for Federated Learning,"We propose a novel contrastive learning framework to effectively address the +challenges of data heterogeneity in federated learning. We first analyze the +inconsistency of gradient updates across clients during local training and +establish its dependence on the distribution of feature representations, +leading to the derivation of the supervised contrastive learning (SCL) +objective to mitigate local deviations. In addition, we show that a na\""ive +adoption of SCL in federated learning leads to representation collapse, +resulting in slow convergence and limited performance gains. To address this +issue, we introduce a relaxed contrastive learning loss that imposes a +divergence penalty on excessively similar sample pairs within each class. This +strategy prevents collapsed representations and enhances feature +transferability, facilitating collaborative training and leading to significant +performance improvements. Our framework outperforms all existing federated +learning approaches by huge margins on the standard benchmarks through +extensive experimental results.",cs.LG,['cs.LG'] +Rewrite the stars,Xu Ma · Xiyang Dai · Yue Bai · Yizhou Wang · Yun Fu, ,https://arxiv.org/abs/2403.19967,,2403.19967.pdf,Rewrite the Stars,"Recent studies have drawn attention to the untapped potential of the ""star +operation"" (element-wise multiplication) in network design. While intuitive +explanations abound, the foundational rationale behind its application remains +largely unexplored. Our study attempts to reveal the star operation's ability +to map inputs into high-dimensional, non-linear feature spaces -- akin to +kernel tricks -- without widening the network. We further introduce StarNet, a +simple yet powerful prototype, demonstrating impressive performance and low +latency under compact network structure and efficient budget. Like stars in the +sky, the star operation appears unremarkable but holds a vast universe of +potential. Our work encourages further exploration across tasks, with codes +available at https://github.com/ma-xu/Rewrite-the-Stars.",cs.CV,['cs.CV'] +PointInfinity: Resolution-Invariant Point Diffusion Models,Zixuan Huang · Justin Johnson · Shoubhik Debnath · James Rehg · Chao-Yuan Wu,https://zixuanh.com/projects/pointinfinity,https://arxiv.org/abs/2404.03566v1,,2404.03566v1.pdf,PointInfinity: Resolution-Invariant Point Diffusion Models,"We present PointInfinity, an efficient family of point cloud diffusion +models. Our core idea is to use a transformer-based architecture with a +fixed-size, resolution-invariant latent representation. This enables efficient +training with low-resolution point clouds, while allowing high-resolution point +clouds to be generated during inference. More importantly, we show that scaling +the test-time resolution beyond the training resolution improves the fidelity +of generated point clouds and surfaces. We analyze this phenomenon and draw a +link to classifier-free guidance commonly used in diffusion models, +demonstrating that both allow trading off fidelity and variability during +inference. Experiments on CO3D show that PointInfinity can efficiently generate +high-resolution point clouds (up to 131k points, 31 times more than Point-E) +with state-of-the-art quality.",cs.CV,['cs.CV'] +JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation,Yu Zeng · Vishal M. Patel · Haochen Wang · Xun Huang · Ting-Chun Wang · Ming-Yu Liu · Yogesh Balaji,https://research.nvidia.com/labs/dir/jedi/,https://arxiv.org/html/2307.04725v2,,2307.04725v2.pdf,AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning,"With the advance of text-to-image (T2I) diffusion models (e.g., Stable +Diffusion) and corresponding personalization techniques such as DreamBooth and +LoRA, everyone can manifest their imagination into high-quality images at an +affordable cost. However, adding motion dynamics to existing high-quality +personalized T2Is and enabling them to generate animations remains an open +challenge. In this paper, we present AnimateDiff, a practical framework for +animating personalized T2I models without requiring model-specific tuning. At +the core of our framework is a plug-and-play motion module that can be trained +once and seamlessly integrated into any personalized T2Is originating from the +same base T2I. Through our proposed training strategy, the motion module +effectively learns transferable motion priors from real-world videos. Once +trained, the motion module can be inserted into a personalized T2I model to +form a personalized animation generator. We further propose MotionLoRA, a +lightweight fine-tuning technique for AnimateDiff that enables a pre-trained +motion module to adapt to new motion patterns, such as different shot types, at +a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA +on several public representative personalized T2I models collected from the +community. The results demonstrate that our approaches help these models +generate temporally smooth animation clips while preserving the visual quality +and motion diversity. Codes and pre-trained weights are available at +https://github.com/guoyww/AnimateDiff.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models,Shweta Mahajan · Tanzila Rahman · Kwang Moo Yi · Leonid Sigal, ,https://arxiv.org/abs/2312.12416,,2312.12416.pdf,Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models,"The quality of the prompts provided to text-to-image diffusion models +determines how faithful the generated content is to the user's intent, often +requiring `prompt engineering'. To harness visual concepts from target images +without prompt engineering, current approaches largely rely on embedding +inversion by optimizing and then mapping them to pseudo-tokens. However, +working with such high-dimensional vector representations is challenging +because they lack semantics and interpretability, and only allow simple vector +operations when using them. Instead, this work focuses on inverting the +diffusion model to obtain interpretable language prompts directly. The +challenge of doing this lies in the fact that the resulting optimization +problem is fundamentally discrete and the space of prompts is exponentially +large; this makes using standard optimization techniques, such as stochastic +gradient descent, difficult. To this end, we utilize a delayed projection +scheme to optimize for prompts representative of the vocabulary space in the +model. Further, we leverage the findings that different timesteps of the +diffusion process cater to different levels of detail in an image. The later, +noisy, timesteps of the forward diffusion process correspond to the semantic +information, and therefore, prompt inversion in this range provides tokens +representative of the image semantics. We show that our approach can identify +semantically interpretable and meaningful prompts for a target image which can +be used to synthesize diverse images with similar content. We further +illustrate the application of the optimized prompts in evolutionary image +generation and concept removal.",cs.CV,"['cs.CV', 'cs.LG']" +Pixel Aligned Language Models,Jiarui Xu · Xingyi Zhou · Shen Yan · Xiuye Gu · Anurag Arnab · Chen Sun · Xiaolong Wang · Cordelia Schmid,https://jerryxu.net/PixelLLM/,https://arxiv.org/abs/2312.09237,,2312.09237.pdf,Pixel Aligned Language Models,"Large language models have achieved great success in recent years, so as +their variants in vision. Existing vision-language models can describe images +in natural languages, answer visual-related questions, or perform complex +reasoning about the image. However, it is yet unclear how localization tasks, +such as word grounding or referring localization, can be performed using large +language models. In this work, we aim to develop a vision-language model that +can take locations, for example, a set of points or boxes, as either inputs or +outputs. When taking locations as inputs, the model performs +location-conditioned captioning, which generates captions for the indicated +object or region. When generating locations as outputs, our model regresses +pixel coordinates for each output word generated by the language model, and +thus performs dense word grounding. Our model is pre-trained on the Localized +Narrative dataset, which contains pixel-word-aligned captioning from human +attention. We show our model can be applied to various location-aware +vision-language tasks, including referring localization, location-conditioned +captioning, and dense object captioning, archiving state-of-the-art performance +on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .",cs.CV,['cs.CV'] +Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning,Xinshun Wang · Zhongbin Fang · Xia Li · Xiangtai Li · Chen Chen · Mengyuan Liu, ,https://arxiv.org/abs/2312.03703,,2312.03703.pdf,Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning,"In-context learning provides a new perspective for multi-task modeling for +vision and NLP. Under this setting, the model can perceive tasks from prompts +and accomplish them without any extra task-specific head predictions or model +fine-tuning. However, Skeleton sequence modeling via in-context learning +remains unexplored. Directly applying existing in-context models from other +areas onto skeleton sequences fails due to the inter-frame and cross-task pose +similarity that makes it outstandingly hard to perceive the task correctly from +a subtle context. To address this challenge, we propose Skeleton-in-Context +(SiC), an effective framework for in-context skeleton sequence modeling. Our +SiC is able to handle multiple skeleton-based tasks simultaneously after a +single training process and accomplish each task from context according to the +given prompt. It can further generalize to new, unseen tasks according to +customized prompts. To facilitate context perception, we additionally propose a +task-unified prompt, which adaptively learns tasks of different natures, such +as partial joint-level generation, sequence-level prediction, or 2D-to-3D +motion prediction. We conduct extensive experiments to evaluate the +effectiveness of our SiC on multiple tasks, including motion prediction, pose +estimation, joint completion, and future pose estimation. We also evaluate its +generalization capability on unseen tasks such as motion-in-between. These +experiments show that our model achieves state-of-the-art multi-task +performance and even outperforms single-task methods on certain tasks.",cs.CV,['cs.CV'] +CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment,Hyeongmin Lee · Kyoungkook Kang · Jungseul Ok · Sunghyun Cho, ,https://arxiv.org/abs/2404.01123,,2404.01123.pdf,CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment,"Recent image tone adjustment (or enhancement) approaches have predominantly +adopted supervised learning for learning human-centric perceptual assessment. +However, these approaches are constrained by intrinsic challenges of supervised +learning. Primarily, the requirement for expertly-curated or retouched images +escalates the data acquisition expenses. Moreover, their coverage of target +style is confined to stylistic variants inferred from the training data. To +surmount the above challenges, we propose an unsupervised learning-based +approach for text-based image tone adjustment method, CLIPtone, that extends an +existing image enhancement method to accommodate natural language descriptions. +Specifically, we design a hyper-network to adaptively modulate the pretrained +parameters of the backbone model based on text description. To assess whether +the adjusted image aligns with the text description without ground truth image, +we utilize CLIP, which is trained on a vast set of language-image pairs and +thus encompasses knowledge of human perception. The major advantages of our +approach are three fold: (i) minimal data collection expenses, (ii) support for +a range of adjustments, and (iii) the ability to handle novel text descriptions +unseen in training. Our approach's efficacy is demonstrated through +comprehensive experiments, including a user study.",cs.CV,"['cs.CV', 'cs.GR', 'eess.IV']" +PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor,Jaewon Jung · Hongsun Jang · Jaeyong Song · Jinho Lee,https://github.com/jaewonalive/PeerAiD,https://arxiv.org/abs/2403.06668,,2403.06668.pdf,PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor,"Adversarial robustness of the neural network is a significant concern when it +is applied to security-critical domains. In this situation, adversarial +distillation is a promising option which aims to distill the robustness of the +teacher network to improve the robustness of a small student network. Previous +works pretrain the teacher network to make it robust against the adversarial +examples aimed at itself. However, the adversarial examples are dependent on +the parameters of the target network. The fixed teacher network inevitably +degrades its robustness against the unseen transferred adversarial examples +which target the parameters of the student network in the adversarial +distillation process. We propose PeerAiD to make a peer network learn the +adversarial examples of the student network instead of adversarial examples +aimed at itself. PeerAiD is an adversarial distillation that trains the peer +network and the student network simultaneously in order to specialize the peer +network for defending the student network. We observe that such peer networks +surpass the robustness of the pretrained robust teacher model against +adversarial examples aimed at the student network. With this peer network and +adversarial distillation, PeerAiD achieves significantly higher robustness of +the student network with AutoAttack (AA) accuracy by up to 1.66%p and improves +the natural accuracy of the student network by up to 4.72%p with ResNet-18 on +TinyImageNet dataset. Code is available at +https://github.com/jaewonalive/PeerAiD.",cs.LG,"['cs.LG', 'cs.CV']" +MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer,Jianjian Cao · Peng Ye · Shengze Li · Chong Yu · Yansong Tang · Jiwen Lu · Tao Chen,https://github.com/double125/MADTP,https://arxiv.org/abs/2403.02991,,2403.02991.pdf,MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer,"Vision-Language Transformers (VLTs) have shown great success recently, but +are meanwhile accompanied by heavy computation costs, where a major reason can +be attributed to the large number of visual and language tokens. Existing token +pruning research for compressing VLTs mainly follows a single-modality-based +scheme yet ignores the critical role of aligning different modalities for +guiding the token pruning process, causing the important tokens for one +modality to be falsely pruned in another modality branch. Meanwhile, existing +VLT pruning works also lack the flexibility to dynamically compress each layer +based on different input samples. To this end, we propose a novel framework +named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for +accelerating various VLTs. Specifically, we first introduce a well-designed +Multi-modality Alignment Guidance (MAG) module that can align features of the +same semantic concept from different modalities, to ensure the pruned tokens +are less important for all modalities. We further design a novel Dynamic Token +Pruning (DTP) module, which can adaptively adjust the token compression ratio +in each layer based on different input instances. Extensive experiments on +various benchmarks demonstrate that MADTP significantly reduces the +computational complexity of kinds of multimodal models while preserving +competitive performance. Notably, when applied to the BLIP model in the NLVR2 +dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance +degradation.",cs.CV,['cs.CV'] +VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models,Hyeonho Jeong · Geon Yeong Park · Jong Chul Ye,https://video-motion-customization.github.io/,https://arxiv.org/abs/2312.00845,,2312.00845.pdf,VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models,"Text-to-video diffusion models have advanced video generation significantly. +However, customizing these models to generate videos with tailored motions +presents a substantial challenge. In specific, they encounter hurdles in (a) +accurately reproducing motion from a target video, and (b) creating diverse +visual variations. For example, straightforward extensions of static image +customization methods to video often lead to intricate entanglements of +appearance and motion data. To tackle this, here we present the Video Motion +Customization (VMC) framework, a novel one-shot tuning approach crafted to +adapt temporal attention layers within video diffusion models. Our approach +introduces a novel motion distillation objective using residual vectors between +consecutive frames as a motion reference. The diffusion process then preserves +low-frequency motion trajectories while mitigating high-frequency +motion-unrelated noise in image space. We validate our method against +state-of-the-art video generative models across diverse real-world motions and +contexts. Our codes, data and the project demo can be found at +https://video-motion-customization.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting,Zijie Chen · Lichao Zhang · Fangsheng Weng · Lili Pan · ZHENZHONG Lan,https://github.com/zzjchen/Tailored-Visions,https://arxiv.org/abs/2310.08129,,2310.08129.pdf,Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting,"Despite significant progress in the field, it is still challenging to create +personalized visual representations that align closely with the desires and +preferences of individual users. This process requires users to articulate +their ideas in words that are both comprehensible to the models and accurately +capture their vision, posing difficulties for many users. In this paper, we +tackle this challenge by leveraging historical user interactions with the +system to enhance user prompts. We propose a novel approach that involves +rewriting user prompts based on a newly collected large-scale text-to-image +dataset with over 300k prompts from 3115 users. Our rewriting model enhances +the expressiveness and alignment of user prompts with their intended visual +outputs. Experimental results demonstrate the superiority of our methods over +baseline approaches, as evidenced in our new offline evaluation method and +online tests. Our code and dataset are available at +https://github.com/zzjchen/Tailored-Visions.",cs.CV,['cs.CV'] +VideoBooth: Diffusion-based Video Generation with Image Prompts,Yuming Jiang · Tianxing Wu · Shuai Yang · Chenyang Si · Dahua Lin · Yu Qiao · Chen Change Loy · Ziwei Liu, ,https://arxiv.org/abs/2312.00777,,2312.00777.pdf,VideoBooth: Diffusion-based Video Generation with Image Prompts,"Text-driven video generation witnesses rapid progress. However, merely using +text prompts is not enough to depict the desired subject appearance that +accurately aligns with users' intents, especially for customized content +creation. In this paper, we study the task of video generation with image +prompts, which provide more accurate and direct content control beyond the text +prompts. Specifically, we propose a feed-forward framework VideoBooth, with two +dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine +manner. Coarse visual embeddings from image encoder provide high-level +encodings of image prompts, while fine visual embeddings from the proposed +attention injection module provide multi-scale and detailed encoding of image +prompts. These two complementary embeddings can faithfully capture the desired +appearance. 2) In the attention injection module at fine level, multi-scale +image prompts are fed into different cross-frame attention layers as additional +keys and values. This extra spatial information refines the details in the +first frame and then it is propagated to the remaining frames, which maintains +temporal consistency. Extensive experiments demonstrate that VideoBooth +achieves state-of-the-art performance in generating customized high-quality +videos with subjects specified in image prompts. Notably, VideoBooth is a +generalizable framework where a single model works for a wide range of image +prompts with feed-forward pass.",cs.CV,['cs.CV'] +FreeU: Free Lunch in Diffusion U-Net,Chenyang Si · Ziqi Huang · Yuming Jiang · Ziwei Liu,https://chenyangsi.top/FreeU/,https://arxiv.org/abs/2309.11497,,2309.11497.pdf,FreeU: Free Lunch in Diffusion U-Net,"In this paper, we uncover the untapped potential of diffusion U-Net, which +serves as a ""free lunch"" that substantially improves the generation quality on +the fly. We initially investigate the key contributions of the U-Net +architecture to the denoising process and identify that its main backbone +primarily contributes to denoising, whereas its skip connections mainly +introduce high-frequency features into the decoder module, causing the network +to overlook the backbone semantics. Capitalizing on this discovery, we propose +a simple yet effective method-termed ""FreeU"" - that enhances generation quality +without additional training or finetuning. Our key insight is to strategically +re-weight the contributions sourced from the U-Net's skip connections and +backbone feature maps, to leverage the strengths of both components of the +U-Net architecture. Promising results on image and video generation tasks +demonstrate that our FreeU can be readily integrated to existing diffusion +models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, +to improve the generation quality with only a few lines of code. All you need +is to adjust two scaling factors during inference. Project page: +https://chenyangsi.top/FreeU/.",cs.CV,['cs.CV'] +One-Shot Structure-Aware Stylized Image Synthesis,Hansam Cho · Jonghyun Lee · Seunggyu Chang · Yonghyun Jeong,https://github.com/hansam95/OSASIS,https://arxiv.org/abs/2402.17275,,2402.17275.pdf,One-Shot Structure-Aware Stylized Image Synthesis,"While GAN-based models have been successful in image stylization tasks, they +often struggle with structure preservation while stylizing a wide range of +input images. Recently, diffusion models have been adopted for image +stylization but still lack the capability to maintain the original quality of +input images. Building on this, we propose OSASIS: a novel one-shot stylization +method that is robust in structure preservation. We show that OSASIS is able to +effectively disentangle the semantics from the structure of an image, allowing +it to control the level of content and style implemented to a given input. We +apply OSASIS to various experimental settings, including stylization with +out-of-domain reference images and stylization with text-driven manipulation. +Results show that OSASIS outperforms other stylization methods, especially for +input images that were rarely encountered during training, providing a +promising solution to stylization via diffusion models.",cs.CV,['cs.CV'] +OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,Noor Ahmed · Anna Kukleva · Bernt Schiele, ,https://arxiv.org/abs/2403.18550,,2403.18550.pdf,OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,"Few-Shot Class-Incremental Learning (FSCIL) introduces a paradigm in which +the problem space expands with limited data. FSCIL methods inherently face the +challenge of catastrophic forgetting as data arrives incrementally, making +models susceptible to overwriting previously acquired knowledge. Moreover, +given the scarcity of labeled samples available at any given time, models may +be prone to overfitting and find it challenging to strike a balance between +extensive pretraining and the limited incremental data. To address these +challenges, we propose the OrCo framework built on two core principles: +features' orthogonality in the representation space, and contrastive learning. +In particular, we improve the generalization of the embedding space by +employing a combination of supervised and self-supervised contrastive losses +during the pretraining phase. Additionally, we introduce OrCo loss to address +challenges arising from data limitations during incremental sessions. Through +feature space perturbations and orthogonality between classes, the OrCo loss +maximizes margins and reserves space for the following incremental data. This, +in turn, ensures the accommodation of incoming classes in the feature space +without compromising previously acquired knowledge. Our experimental results +showcase state-of-the-art performance across three benchmark datasets, +including mini-ImageNet, CIFAR100, and CUB datasets. Code is available at +https://github.com/noorahmedds/OrCo",cs.CV,['cs.CV'] +ZeroShape: Regression-based Zero-shot Shape Reconstruction,Zixuan Huang · Stefan Stojanov · Anh Thai · Varun Jampani · James Rehg, ,https://arxiv.org/abs/2312.14198,,2312.14198.pdf,ZeroShape: Regression-based Zero-shot Shape Reconstruction,"We study the problem of single-image zero-shot 3D shape reconstruction. +Recent works learn zero-shot shape reconstruction through generative modeling +of 3D assets, but these models are computationally expensive at train and +inference time. In contrast, the traditional approach to this problem is +regression-based, where deterministic models are trained to directly regress +the object shape. Such regression methods possess much higher computational +efficiency than generative methods. This raises a natural question: is +generative modeling necessary for high performance, or conversely, are +regression-based approaches still competitive? To answer this, we design a +strong regression-based model, called ZeroShape, based on the converging +findings in this field and a novel insight. We also curate a large real-world +evaluation benchmark, with objects from three different real-world 3D datasets. +This evaluation benchmark is more diverse and an order of magnitude larger than +what prior works use to quantitatively evaluate their models, aiming at +reducing the evaluation variance in our field. We show that ZeroShape not only +achieves superior performance over state-of-the-art methods, but also +demonstrates significantly higher computational and data efficiency.",cs.CV,['cs.CV'] +Robust Self-calibration of Focal Lengths from the Fundamental Matrix,Viktor Kocur · Daniel Kyselica · Zuzana Kukelova,https://github.com/kocurvik/robust_self_calibration,https://arxiv.org/abs/2311.16304,,2311.16304.pdf,Robust Self-calibration of Focal Lengths from the Fundamental Matrix,"The problem of self-calibration of two cameras from a given fundamental +matrix is one of the basic problems in geometric computer vision. Under the +assumption of known principal points and square pixels, the well-known Bougnoux +formula offers a means to compute the two unknown focal lengths. However, in +many practical situations, the formula yields inaccurate results due to +commonly occurring singularities. Moreover, the estimates are sensitive to +noise in the computed fundamental matrix and to the assumed positions of the +principal points. In this paper, we therefore propose an efficient and robust +iterative method to estimate the focal lengths along with the principal points +of the cameras given a fundamental matrix and priors for the estimated camera +parameters. In addition, we study a computationally efficient check of models +generated within RANSAC that improves the accuracy of the estimated models +while reducing the total computational time. Extensive experiments on real and +synthetic data show that our iterative method brings significant improvements +in terms of the accuracy of the estimated focal lengths over the Bougnoux +formula and other state-of-the-art methods, even when relying on inaccurate +priors.",cs.CV,['cs.CV'] +GauHuman: Articulated Gaussian Splatting from Monocular Human Videos,Shoukang Hu · Tao Hu · Ziwei Liu, ,,https://paperswithcode.com/paper/gauhuman-articulated-gaussian-splatting-from,,,,,nan +Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation,Hyunwoo Ryu · Jiwoo Kim · Hyunseok An · Junwoo Chang · Joohwan Seo · Taehan Kim · Yubin Kim · Chaewon Hwang · Jongeun Choi · Roberto Horowitz,https://sites.google.com/view/diffusion-edfs,https://arxiv.org/abs/2309.02685,,2309.02685.pdf,Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation,"Diffusion generative modeling has become a promising approach for learning +robotic manipulation tasks from stochastic human demonstrations. In this paper, +we present Diffusion-EDFs, a novel SE(3)-equivariant diffusion-based approach +for visual robotic manipulation tasks. We show that our proposed method +achieves remarkable data efficiency, requiring only 5 to 10 human +demonstrations for effective end-to-end training in less than an hour. +Furthermore, our benchmark experiments demonstrate that our approach has +superior generalizability and robustness compared to state-of-the-art methods. +Lastly, we validate our methods with real hardware experiments. Project +Website: https://sites.google.com/view/diffusion-edfs/home",cs.RO,"['cs.RO', 'cs.AI', 'cs.LG']" +DREAM: Diffusion Rectification and Estimation-Adaptive Models,Jinxin Zhou · Tianyu Ding · Tianyi Chen · Jiachen Jiang · Ilya Zharkov · Zhihui Zhu · Luming Liang, ,https://arxiv.org/abs/2312.00210,,2312.00210.pdf,DREAM: Diffusion Rectification and Estimation-Adaptive Models,"We present DREAM, a novel training framework representing Diffusion +Rectification and Estimation Adaptive Models, requiring minimal code changes +(just three lines) yet significantly enhancing the alignment of training with +sampling in diffusion models. DREAM features two components: diffusion +rectification, which adjusts training to reflect the sampling process, and +estimation adaptation, which balances perception against distortion. When +applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff +between minimizing distortion and preserving high image quality. Experiments +demonstrate DREAM's superiority over standard diffusion-based SR methods, +showing a $2$ to $3\times $ faster training convergence and a $10$ to +$20\times$ reduction in sampling steps to achieve comparable results. We hope +DREAM will inspire a rethinking of diffusion model training paradigms.",cs.CV,"['cs.CV', 'cs.AI']" +Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers,Hongjie Wang · Bhishma Dedhia · Niraj Jha,https://jha-lab.github.io/zerotprune/,https://ar5iv.labs.arxiv.org/html/2305.17328,,2305.17328.pdf,Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers,"Deployment of Transformer models on edge devices is becoming increasingly +challenging due to the exponentially growing inference cost that scales +quadratically with the number of tokens in the input sequence. Token pruning is +an emerging solution to address this challenge due to its ease of deployment on +various Transformer backbones. However, most token pruning methods require +computationally expensive fine-tuning, which is undesirable in many edge +deployment cases. In this work, we propose Zero-TPrune, the first zero-shot +method that considers both the importance and similarity of tokens in +performing token pruning. It leverages the attention graph of pre-trained +Transformer models to produce an importance distribution for tokens via our +proposed Weighted Page Rank (WPR) algorithm. This distribution further guides +token partitioning for efficient similarity-based pruning. Due to the +elimination of the fine-tuning overhead, Zero-TPrune can prune large models at +negligible computational cost, switch between different pruning configurations +at no computational cost, and perform hyperparameter tuning efficiently. We +evaluate the performance of Zero-TPrune on vision tasks by applying it to +various vision Transformer backbones and testing them on ImageNet. Without any +fine-tuning, Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7% and improves +its throughput by 45.3% with only 0.4% accuracy loss. Compared with +state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only +eliminates the need for fine-tuning after pruning but also does so with only +0.1% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning +methods, Zero-TPrune reduces accuracy loss by up to 49% with similar FLOPs +budgets. Project webpage: https://jha-lab.github.io/zerotprune.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'eess.IV']" +FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects,Bowen Wen · Wei Yang · Jan Kautz · Stan Birchfield, ,https://arxiv.org/abs/2312.08344,,2312.08344.pdf,FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects,"We present FoundationPose, a unified foundation model for 6D object pose +estimation and tracking, supporting both model-based and model-free setups. Our +approach can be instantly applied at test-time to a novel object without +fine-tuning, as long as its CAD model is given, or a small number of reference +images are captured. We bridge the gap between these two setups with a neural +implicit representation that allows for effective novel view synthesis, keeping +the downstream pose estimation modules invariant under the same unified +framework. Strong generalizability is achieved via large-scale synthetic +training, aided by a large language model (LLM), a novel transformer-based +architecture, and contrastive learning formulation. Extensive evaluation on +multiple public datasets involving challenging scenarios and objects indicate +our unified approach outperforms existing methods specialized for each task by +a large margin. In addition, it even achieves comparable results to +instance-level methods despite the reduced assumptions. Project page: +https://nvlabs.github.io/FoundationPose/",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +Improving Bird’s Eye View Semantic Segmentation by Task Decomposition,Tianhao Zhao · Yongcan Chen · Yu Wu · Tianyang Liu · Bo Du · Peilun Xiao · shi qiu · Hongda Yang · Guozhen Li · yi yang · Yutian Lin, ,https://arxiv.org/abs/2404.01925v1,,2404.01925v1.pdf,Improving Bird's Eye View Semantic Segmentation by Task Decomposition,"Semantic segmentation in bird's eye view (BEV) plays a crucial role in +autonomous driving. Previous methods usually follow an end-to-end pipeline, +directly predicting the BEV segmentation map from monocular RGB inputs. +However, the challenge arises when the RGB inputs and BEV targets from distinct +perspectives, making the direct point-to-point predicting hard to optimize. In +this paper, we decompose the original BEV segmentation task into two stages, +namely BEV map reconstruction and RGB-BEV feature alignment. In the first +stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps +given corrupted noisy latent representation, which urges the decoder to learn +fundamental knowledge of typical BEV patterns. The second stage involves +mapping RGB input images into the BEV latent space of the first stage, directly +optimizing the correlations between the two views at the feature level. Our +approach simplifies the complexity of combining perception and generation into +distinct steps, equipping the model to handle intricate and challenging scenes +effectively. Besides, we propose to transform the BEV segmentation map from the +Cartesian to the polar coordinate system to establish the column-wise +correspondence between RGB images and BEV maps. Moreover, our method requires +neither multi-scale features nor camera intrinsic parameters for depth +estimation and saves computational overhead. Extensive experiments on nuScenes +and Argoverse show the effectiveness and efficiency of our method. Code is +available at https://github.com/happytianhao/TaDe.",cs.CV,"['cs.CV', 'cs.AI']" +Optimal Transport Aggregation for Visual Place Recognition,Sergio Izquierdo · Javier Civera,https://serizba.github.io/salad.html,https://arxiv.org/abs/2311.15937,,2311.15937.pdf,Optimal Transport Aggregation for Visual Place Recognition,"The task of Visual Place Recognition (VPR) aims to match a query image +against references from an extensive database of images from different places, +relying solely on visual cues. State-of-the-art pipelines focus on the +aggregation of features extracted from a deep backbone, in order to form a +global descriptor for each image. In this context, we introduce SALAD (Sinkhorn +Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's +soft-assignment of local features to clusters as an optimal transport problem. +In SALAD, we consider both feature-to-cluster and cluster-to-feature relations +and we also introduce a 'dustbin' cluster, designed to selectively discard +features deemed non-informative, enhancing the overall descriptor quality. +Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides +enhanced description power for the local features, and dramatically reduces the +required training time. As a result, our single-stage method not only surpasses +single-stage baselines in public VPR datasets, but also surpasses two-stage +methods that add a re-ranking with significantly higher cost. Code and models +are available at https://github.com/serizba/salad.",cs.CV,['cs.CV'] +DAP: A Dynamic Adversarial Patch for Evading Person Detectors,Amira Guesmi · Ruitian Ding · Muhammad Abdullah Hanif · Ihsen Alouani · Muhammad Shafique, ,,https://dblp.org/rec/journals/corr/abs-2305-11618,,,,,nan +UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures,Mingyuan Zhou · Rakib Hyder · Ziwei Xuan · Guo-Jun Qi,https://usrc-sea.github.io/UltrAvatar/,https://arxiv.org/abs/2401.11078,,2401.11078.pdf,UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures,"Recent advances in 3D avatar generation have gained significant attentions. +These breakthroughs aim to produce more realistic animatable avatars, narrowing +the gap between virtual and real-world experiences. Most of existing works +employ Score Distillation Sampling (SDS) loss, combined with a differentiable +renderer and text condition, to guide a diffusion model in generating 3D +avatars. However, SDS often generates oversmoothed results with few facial +details, thereby lacking the diversity compared with ancestral sampling. On the +other hand, other works generate 3D avatar from a single image, where the +challenges of unwanted lighting effects, perspective views, and inferior image +quality make them difficult to reliably reconstruct the 3D face meshes with the +aligned complete textures. In this paper, we propose a novel 3D avatar +generation approach termed UltrAvatar with enhanced fidelity of geometry, and +superior quality of physically based rendering (PBR) textures without unwanted +lighting. To this end, the proposed approach presents a diffuse color +extraction model and an authenticity guided texture diffusion model. The former +removes the unwanted lighting effects to reveal true diffuse colors so that the +generated avatars can be rendered under various lighting conditions. The latter +follows two gradient-based guidances for generating PBR textures to render +diverse face-identity features and details better aligning with 3D mesh +geometry. We demonstrate the effectiveness and robustness of the proposed +method, outperforming the state-of-the-art methods by a large margin in the +experiments.",cs.CV,['cs.CV'] +IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing,Shaofei Wang · Bozidar Antic · Andreas Geiger · Siyu Tang, ,https://arxiv.org/abs/2312.05210,,2312.05210.pdf,IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing,"We present IntrinsicAvatar, a novel approach to recovering the intrinsic +properties of clothed human avatars including geometry, albedo, material, and +environment lighting from only monocular videos. Recent advancements in +human-based neural rendering have enabled high-quality geometry and appearance +reconstruction of clothed humans from just monocular videos. However, these +methods bake intrinsic properties such as albedo, material, and environment +lighting into a single entangled neural representation. On the other hand, only +a handful of works tackle the problem of estimating geometry and disentangled +appearance properties of clothed humans from monocular videos. They usually +achieve limited quality and disentanglement due to approximations of secondary +shading effects via learned MLPs. In this work, we propose to model secondary +shading effects explicitly via Monte-Carlo ray tracing. We model the rendering +process of clothed humans as a volumetric scattering process, and combine ray +tracing with body articulation. Our approach can recover high-quality geometry, +albedo, material, and lighting properties of clothed humans from a single +monocular video, without requiring supervised pre-training using ground truth +materials. Furthermore, since we explicitly model the volumetric scattering +process and ray tracing, our model naturally generalizes to novel poses, +enabling animation of the reconstructed avatar in novel lighting conditions.",cs.CV,['cs.CV'] +A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark,Jakub Paplham · Vojtech Franc, ,https://arxiv.org/abs/2307.04570v2,,2307.04570v2.pdf,A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark,"Comparing different age estimation methods poses a challenge due to the +unreliability of published results stemming from inconsistencies in the +benchmarking process. Previous studies have reported continuous performance +improvements over the past decade using specialized methods; however, our +findings challenge these claims. This paper identifies two trivial, yet +persistent issues with the currently used evaluation protocol and describes how +to resolve them. We describe our evaluation protocol in detail and provide +specific examples of how the protocol should be used. We utilize the protocol +to offer an extensive comparative analysis for state-of-the-art facial age +estimation methods. Surprisingly, we find that the performance differences +between the methods are negligible compared to the effect of other factors, +such as facial alignment, facial coverage, image resolution, model +architecture, or the amount of data used for pretraining. We use the gained +insights to propose using FaRL as the backbone model and demonstrate its +efficiency. The results emphasize the importance of consistent data +preprocessing practices for reliable and meaningful comparisons. We make our +source code public at +https://github.com/paplhjak/Facial-Age-Estimation-Benchmark.",cs.CV,"['cs.CV', 'cs.LG']" +REACTO: Reconstructing Articulated Objects from a Single Video,Chaoyue Song · Jiacheng Wei · Chuan-Sheng Foo · Guosheng Lin · Fayao Liu,https://chaoyuesong.github.io/REACTO/,https://arxiv.org/abs/2404.11151,,2404.11151.pdf,REACTO: Reconstructing Articulated Objects from a Single Video,"In this paper, we address the challenge of reconstructing general articulated +3D objects from a single video. Existing works employing dynamic neural +radiance fields have advanced the modeling of articulated objects like humans +and animals from videos, but face challenges with piece-wise rigid general +articulated objects due to limitations in their deformation models. To tackle +this, we propose Quasi-Rigid Blend Skinning, a novel deformation model that +enhances the rigidity of each part while maintaining flexible deformation of +the joints. Our primary insight combines three distinct approaches: 1) an +enhanced bone rigging system for improved component modeling, 2) the use of +quasi-sparse skinning weights to boost part rigidity and reconstruction +fidelity, and 3) the application of geodesic point assignment for precise +motion and seamless deformation. Our method outperforms previous works in +producing higher-fidelity 3D reconstructions of general articulated objects, as +demonstrated on both real and synthetic datasets. Project page: +https://chaoyuesong.github.io/REACTO.",cs.CV,['cs.CV'] +DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation,Junming Chen · Yunfei Liu · Jianan Wang · Ailing Zeng · Yu Li · Qifeng Chen,https://jeremycjm.github.io/proj/DiffSHEG/,https://arxiv.org/abs/2401.04747,,2401.04747.pdf,DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation,"We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D +Expression and Gesture generation with arbitrary length. While previous works +focused on co-speech gesture or expression generation individually, the joint +generation of synchronized expressions and gestures remains barely explored. To +address this, our diffusion-based co-speech motion generation transformer +enables uni-directional information flow from expression to gesture, +facilitating improved matching of joint expression-gesture distributions. +Furthermore, we introduce an outpainting-based sampling strategy for arbitrary +long sequence generation in diffusion models, offering flexibility and +computational efficiency. Our method provides a practical solution that +produces high-quality synchronized expression and gesture generation driven by +speech. Evaluated on two public datasets, our approach achieves +state-of-the-art performance both quantitatively and qualitatively. +Additionally, a user study confirms the superiority of DiffSHEG over prior +approaches. By enabling the real-time generation of expressive and synchronized +motions, DiffSHEG showcases its potential for various applications in the +development of digital humans and embodied agents.",cs.SD,"['cs.SD', 'cs.AI', 'cs.CV', 'cs.GR', 'eess.AS']" +Building Vision-Language Models on Solid Foundations with Masked Distillation,Sepehr Sameni · Kushal Kafle · Hao Tan · Simon Jenni, ,https://arxiv.org/abs/2311.03149,,2311.03149.pdf,Asymmetric Masked Distillation for Pre-Training Small Foundation Models,"Self-supervised foundation models have shown great potential in computer +vision thanks to the pre-training paradigm of masked autoencoding. Scale is a +primary factor influencing the performance of these foundation models. However, +these large foundation models often result in high computational cost. This +paper focuses on pre-training relatively small vision transformer models that +could be efficiently adapted to downstream tasks. Specifically, taking +inspiration from knowledge distillation in model compression, we propose a new +asymmetric masked distillation (AMD) framework for pre-training relatively +small models with autoencoding. The core of AMD is to devise an asymmetric +masking strategy, where the teacher model is enabled to see more context +information with a lower masking ratio, while the student model is still +equipped with a high masking ratio. We design customized multi-layer feature +alignment between the teacher encoder and student encoder to regularize the +pre-training of student MAE. To demonstrate the effectiveness and versatility +of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively +small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the +ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B +model on the Something-in-Something V2 dataset, a 3.7% improvement over the +original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to +downstream tasks and obtain consistent performance improvement over the +original masked autoencoding. The code and models are available at +https://github.com/MCG-NJU/AMD.",cs.CV,['cs.CV'] +Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance,Chamin Hewa Koneputugodage · Yizhak Ben-Shabat · Dylan Campbell · Stephen Gould, ,http://export.arxiv.org/abs/2310.07997,,2310.07997.pdf,PG-NeuS: Robust and Efficient Point Guidance for Multi-View Neural Surface Reconstruction,"Recently, learning multi-view neural surface reconstruction with the +supervision of point clouds or depth maps has been a promising way. However, +due to the underutilization of prior information, current methods still +struggle with the challenges of limited accuracy and excessive time complexity. +In addition, prior data perturbation is also an important but rarely considered +issue. To address these challenges, we propose a novel point-guided method +named PG-NeuS, which achieves accurate and efficient reconstruction while +robustly coping with point noise. Specifically, aleatoric uncertainty of the +point cloud is modeled to capture the distribution of noise, leading to noise +robustness. Furthermore, a Neural Projection module connecting points and +images is proposed to add geometric constraints to implicit surface, achieving +precise point guidance. To better compensate for geometric bias between volume +rendering and point modeling, high-fidelity points are filtered into a Bias +Network to further improve details representation. Benefiting from the +effective point guidance, even with a lightweight network, the proposed PG-NeuS +achieves fast convergence with an impressive 11x speedup compared to NeuS. +Extensive experiments show that our method yields high-quality surfaces with +high efficiency, especially for fine-grained details and smooth regions, +outperforming the state-of-the-art methods. Moreover, it exhibits strong +robustness to noisy data and sparse data.",cs.CV,"['cs.CV', 'cs.AI']" +Self-correcting LLM-controlled Diffusion,Tsung-Han Wu · Long Lian · Joseph Gonzalez · Boyi Li · Trevor Darrell,https://self-correcting-llm-diffusion.github.io/,https://arxiv.org/abs/2311.16090,,2311.16090.pdf,Self-correcting LLM-controlled Diffusion Models,"Text-to-image generation has witnessed significant progress with the advent +of diffusion models. Despite the ability to generate photorealistic images, +current text-to-image diffusion models still often struggle to accurately +interpret and follow complex input text prompts. In contrast to existing models +that aim to generate images only with their best effort, we introduce +Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that +generates an image from the input prompt, assesses its alignment with the +prompt, and performs self-corrections on the inaccuracies in the generated +image. Steered by an LLM controller, SLD turns text-to-image generation into an +iterative closed-loop process, ensuring correctness in the resulting image. SLD +is not only training-free but can also be seamlessly integrated with diffusion +models behind API access, such as DALL-E 3, to further boost the performance of +state-of-the-art diffusion models. Experimental results show that our approach +can rectify a majority of incorrect generations, particularly in generative +numeracy, attribute binding, and spatial relationships. Furthermore, by simply +adjusting the instructions to the LLM, SLD can perform image editing tasks, +bridging the gap between text-to-image generation and image editing pipelines. +We will make our code available for future research and applications.",cs.CV,['cs.CV'] +Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval,Yucheng Suo · Fan Ma · Linchao Zhu · Yi Yang, ,https://arxiv.org/abs/2403.16005,,2403.16005.pdf,Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval,"We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to +retrieve the target image given a reference image and a description without +training on the triplet datasets. Previous works generate pseudo-word tokens by +projecting the reference image features to the text embedding space. However, +they focus on the global visual representation, ignoring the representation of +detailed attributes, e.g., color, object number and layout. To address this +challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image +retrieval framework (KEDs). KEDs implicitly models the attributes of the +reference images by incorporating a database. The database enriches the +pseudo-word tokens by providing relevant images and captions, emphasizing +shared attribute information in various aspects. In this way, KEDs recognizes +the reference image from diverse perspectives. Moreover, KEDs adopts an extra +stream that aligns pseudo-word tokens with textual concepts, leveraging +pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated +in this stream are explicitly aligned with fine-grained semantics in the text +embedding space. Extensive experiments on widely used benchmarks, i.e. +ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms +previous zero-shot composed image retrieval methods.",cs.CV,['cs.CV'] +MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training,Pavan Kumar Anasosalu Vasu · Hadi Pouransari · Fartash Faghri · Raviteja Vemulapalli · Oncel Tuzel,https://github.com/apple/ml-mobileclip,https://arxiv.org/abs/2311.17049,,2311.17049.pdf,MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training,"Contrastive pretraining of image-text foundation models, such as CLIP, +demonstrated excellent zero-shot performance and improved robustness on a wide +range of downstream tasks. However, these models utilize large +transformer-based encoders with significant memory and latency overhead which +pose challenges for deployment on mobile devices. In this work, we introduce +MobileCLIP -- a new family of efficient image-text models optimized for runtime +performance along with a novel and efficient training approach, namely +multi-modal reinforced training. The proposed training approach leverages +knowledge transfer from an image captioning model and an ensemble of strong +CLIP encoders to improve the accuracy of efficient models. Our approach avoids +train-time compute overhead by storing the additional knowledge in a reinforced +dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for +zero-shot classification and retrieval tasks on several datasets. Our +MobileCLIP-S2 variant is 2.3$\times$ faster while more accurate compared to +previous best CLIP model based on ViT-B/16. We further demonstrate the +effectiveness of our multi-modal reinforced training by training a CLIP model +based on ViT-B/16 image backbone and achieving +2.9% average performance +improvement on 38 evaluation benchmarks compared to the previous best. +Moreover, we show that the proposed approach achieves 10$\times$-1000$\times$ +improved learning efficiency when compared with non-reinforced CLIP training. +Code and models are available at https://github.com/apple/ml-mobileclip .",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +Locally Adaptive Neural 3D Morphable Models,Michail Tarasiou · Rolandos Alexandros Potamias · Eimear O' Sullivan · Stylianos Ploumpis · Stefanos Zafeiriou, ,https://arxiv.org/abs/2401.02937,,2401.02937.pdf,Locally Adaptive Neural 3D Morphable Models,"We present the Locally Adaptive Morphable Model (LAMM), a highly flexible +Auto-Encoder (AE) framework for learning to generate and manipulate 3D meshes. +We train our architecture following a simple self-supervised training scheme in +which input displacements over a set of sparse control vertices are used to +overwrite the encoded geometry in order to transform one training sample into +another. During inference, our model produces a dense output that adheres +locally to the specified sparse geometry while maintaining the overall +appearance of the encoded object. This approach results in state-of-the-art +performance in both disentangling manipulated geometry and 3D mesh +reconstruction. To the best of our knowledge LAMM is the first end-to-end +framework that enables direct local control of 3D vertex geometry in a single +forward pass. A very efficient computational graph allows our network to train +with only a fraction of the memory required by previous methods and run faster +during inference, generating 12k vertex meshes at $>$60fps on a single CPU +thread. We further leverage local geometry control as a primitive for higher +level editing operations and present a set of derivative capabilities such as +swapping and sampling object parts. Code and pretrained models can be found at +https://github.com/michaeltrs/LAMM.",cs.CV,['cs.CV'] +Finsler-Laplace-Beltrami Operators with Application to Shape Analysis,Simon Weber · Thomas Dagès · Maolin Gao · Daniel Cremers, ,https://arxiv.org/abs/2404.03999,,2404.03999.pdf,Finsler-Laplace-Beltrami Operators with Application to Shape Analysis,"The Laplace-Beltrami operator (LBO) emerges from studying manifolds equipped +with a Riemannian metric. It is often called the Swiss army knife of geometry +processing as it allows to capture intrinsic shape information and gives rise +to heat diffusion, geodesic distances, and a multitude of shape descriptors. It +also plays a central role in geometric deep learning. In this work, we explore +Finsler manifolds as a generalization of Riemannian manifolds. We revisit the +Finsler heat equation and derive a Finsler heat kernel and a +Finsler-Laplace-Beltrami Operator (FLBO): a novel theoretically justified +anisotropic Laplace-Beltrami operator (ALBO). In experimental evaluations we +demonstrate that the proposed FLBO is a valuable alternative to the traditional +Riemannian-based LBO and ALBOs for spatial filtering and shape correspondence +estimation. We hope that the proposed Finsler heat kernel and the FLBO will +inspire further exploration of Finsler geometry in the computer vision +community.",cs.CV,['cs.CV'] +InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,Zhe Chen · Jiannan Wu · Wenhai Wang · Weijie Su · Guo Chen · Sen Xing · Zhong Muyan · Qing-Long Zhang · Xizhou Zhu · Lewei Lu · Bin Li · Ping Luo · Tong Lu · Yu Qiao · Jifeng Dai, ,https://arxiv.org/abs/2312.14238,,2312.14238.pdf,InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,"The exponential growth of large language models (LLMs) has opened up numerous +possibilities for multimodal AGI systems. However, the progress in vision and +vision-language foundation models, which are also critical elements of +multi-modal AGI, has not kept pace with LLMs. In this work, we design a +large-scale vision-language foundation model (InternVL), which scales up the +vision foundation model to 6 billion parameters and progressively aligns it +with the LLM, using web-scale image-text data from various sources. This model +can be broadly applied to and achieve state-of-the-art performance on 32 +generic visual-linguistic benchmarks including visual perception tasks such as +image-level or pixel-level recognition, vision-language tasks such as zero-shot +image/video classification, zero-shot image/video-text retrieval, and link with +LLMs to create multi-modal dialogue systems. It has powerful visual +capabilities and can be a good alternative to the ViT-22B. We hope that our +research could contribute to the development of multi-modal large models. Code +and models are available at https://github.com/OpenGVLab/InternVL.",cs.CV,['cs.CV'] +LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry,Weirong Chen · Le Chen · Rui Wang · Marc Pollefeys,https://chiaki530.github.io/projects/leapvo/,https://arxiv.org/abs/2401.01887v1,,2401.01887v1.pdf,LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry,"Visual odometry estimates the motion of a moving camera based on visual +input. Existing methods, mostly focusing on two-view point tracking, often +ignore the rich temporal context in the image sequence, thereby overlooking the +global motion patterns and providing no assessment of the full trajectory +reliability. These shortcomings hinder performance in scenarios with occlusion, +dynamic objects, and low-texture areas. To address these challenges, we present +the Long-term Effective Any Point Tracking (LEAP) module. LEAP innovatively +combines visual, inter-track, and temporal cues with mindfully selected anchors +for dynamic track estimation. Moreover, LEAP's temporal probabilistic +formulation integrates distribution updates into a learnable iterative +refinement module to reason about point-wise uncertainty. Based on these +traits, we develop LEAP-VO, a robust visual odometry system adept at handling +occlusions and dynamic scenes. Our mindful integration showcases a novel +practice by employing long-term point tracking as the front-end. Extensive +experiments demonstrate that the proposed pipeline significantly outperforms +existing baselines across various visual odometry benchmarks.",cs.CV,['cs.CV'] +MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding,Bo He · Hengduo Li · Young Kyun Jang · Menglin Jia · Xuefei Cao · Ashish Shah · Abhinav Shrivastava · Ser-Nam Lim, ,https://arxiv.org/html/2404.05726v2,,2404.05726v2.pdf,MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding,"With the success of large language models (LLMs), integrating the vision +model into LLMs to build vision-language foundation models has gained much more +interest recently. However, existing LLM-based large multimodal models (e.g., +Video-LLaMA, VideoChat) can only take in a limited number of frames for short +video understanding. In this study, we mainly focus on designing an efficient +and effective model for long-term video understanding. Instead of trying to +process more frames simultaneously like most existing work, we propose to +process videos in an online manner and store past video information in a memory +bank. This allows our model to reference historical video content for long-term +analysis without exceeding LLMs' context length constraints or GPU memory +limits. Our memory bank can be seamlessly integrated into current multimodal +LLMs in an off-the-shelf manner. We conduct extensive experiments on various +video understanding tasks, such as long-video understanding, video question +answering, and video captioning, and our model can achieve state-of-the-art +performances across multiple datasets. Code available at +https://boheumd.github.io/MA-LMM/.",cs.CV,['cs.CV'] +Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption,Buzhen Huang · Chen Li · Chongyang Xu · Liang Pan · Yangang Wang · Gim Hee Lee, ,https://arxiv.org/abs/2404.11291,,2404.11291.pdf,Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption,"Existing multi-person human reconstruction approaches mainly focus on +recovering accurate poses or avoiding penetration, but overlook the modeling of +close interactions. In this work, we tackle the task of reconstructing closely +interactive humans from a monocular video. The main challenge of this task +comes from insufficient visual information caused by depth ambiguity and severe +inter-person occlusion. In view of this, we propose to leverage knowledge from +proxemic behavior and physics to compensate the lack of visual information. +This is based on the observation that human interaction has specific patterns +following the social proxemics. Specifically, we first design a latent +representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to +model human interaction. A proxemics and physics guided diffusion model is then +introduced to denoise the initial distribution. We design the diffusion model +as dual branch with each branch representing one individual such that the +interaction can be modeled via cross attention. With the learned priors of +VQ-VAE and physical constraint as the additional information, our proposed +approach is capable of estimating accurate poses that are also proxemics and +physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate +that our method outperforms existing approaches. The code is available at +\url{https://github.com/boycehbz/HumanInteraction}.",cs.CV,['cs.CV'] +The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement,Gabriele Trivigno · Carlo Masone · Barbara Caputo · Torsten Sattler, ,https://arxiv.org/abs/2404.10438v1,,2404.10438v1.pdf,The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement,"Pose refinement is an interesting and practically relevant research +direction. Pose refinement can be used to (1) obtain a more accurate pose +estimate from an initial prior (e.g., from retrieval), (2) as pre-processing, +i.e., to provide a better starting point to a more expensive pose estimator, +(3) as post-processing of a more accurate localizer. Existing approaches focus +on learning features / scene representations for the pose refinement task. This +involves training an implicit scene representation or learning features while +optimizing a camera pose-based loss. A natural question is whether training +specific features / representations is truly necessary or whether similar +results can be already achieved with more generic features. In this work, we +present a simple approach that combines pre-trained features with a particle +filter and a renderable representation of the scene. Despite its simplicity, it +achieves state-of-the-art results, demonstrating that one can easily build a +pose refiner without the need for specific training. The code is at +https://github.com/ga1i13o/mcloc_poseref",cs.CV,['cs.CV'] +GDA: Generalized Diffusion for Robust Test-time Adaptation,Yun-Yun Tsai · Fu-Chen Chen · Albert Chen · Junfeng Yang · Che-Chun Su · Min Sun · Cheng-Hao Kuo, ,https://arxiv.org/abs/2404.00095,,2404.00095.pdf,GDA: Generalized Diffusion for Robust Test-time Adaptation,"Machine learning models struggle with generalization when encountering +out-of-distribution (OOD) samples with unexpected distribution shifts. For +vision tasks, recent studies have shown that test-time adaptation employing +diffusion models can achieve state-of-the-art accuracy improvements on OOD +samples by generating new samples that align with the model's domain without +the need to modify the model's weights. Unfortunately, those studies have +primarily focused on pixel-level corruptions, thereby lacking the +generalization to adapt to a broader range of OOD types. We introduce +Generalized Diffusion Adaptation (GDA), a novel diffusion-based test-time +adaptation method robust against diverse OOD types. Specifically, GDA +iteratively guides the diffusion by applying a marginal entropy loss derived +from the model, in conjunction with style and content preservation losses +during the reverse sampling process. In other words, GDA considers the model's +output behavior with the semantic information of the samples as a whole, which +can reduce ambiguity in downstream tasks during the generation process. +Evaluation across various popular model architectures and OOD benchmarks shows +that GDA consistently outperforms prior work on diffusion-driven adaptation. +Notably, it achieves the highest classification accuracy improvements, ranging +from 4.4\% to 5.02\% on ImageNet-C and 2.5\% to 7.4\% on Rendition, Sketch, and +Stylized benchmarks. This performance highlights GDA's generalization to a +broader range of OOD benchmarks.",cs.CV,['cs.CV'] +RecDiffusion: Rectangling for Image Stitching with Diffusion Models,Tianhao Zhou · Li Haipeng · Ziyi Wang · Ao Luo · Chenlin Zhang · Jiajun Li · Bing Zeng · Shuaicheng Liu, ,https://arxiv.org/abs/2403.19164,,2403.19164.pdf,RecDiffusion: Rectangling for Image Stitching with Diffusion Models,"Image stitching from different captures often results in non-rectangular +boundaries, which is often considered unappealing. To solve non-rectangular +boundaries, current solutions involve cropping, which discards image content, +inpainting, which can introduce unrelated content, or warping, which can +distort non-linear features and introduce artifacts. To overcome these issues, +we introduce a novel diffusion-based learning framework, \textbf{RecDiffusion}, +for image stitching rectangling. This framework combines Motion Diffusion +Models (MDM) to generate motion fields, effectively transitioning from the +stitched image's irregular borders to a geometrically corrected intermediary. +Followed by Content Diffusion Models (CDM) for image detail refinement. +Notably, our sampling process utilizes a weighted map to identify regions +needing correction during each iteration of CDM. Our RecDiffusion ensures +geometric accuracy and overall visual appeal, surpassing all previous methods +in both quantitative and qualitative measures when evaluated on public +benchmarks. Code is released at https://github.com/lhaippp/RecDiffusion.",cs.CV,['cs.CV'] +"Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts",Qin Liu · Jaemin Cho · Mohit Bansal · Marc Niethammer,https://github.com/uncbiag/SegNext,https://arxiv.org/abs/2404.00741,,2404.00741.pdf,"Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts","The goal of interactive image segmentation is to delineate specific regions +within an image via visual or language prompts. Low-latency and high-quality +interactive segmentation with diverse prompts remain challenging for existing +specialist and generalist models. Specialist models, with their limited prompts +and task-specific designs, experience high latency because the image must be +recomputed every time the prompt is updated, due to the joint encoding of image +and visual prompts. Generalist models, exemplified by the Segment Anything +Model (SAM), have recently excelled in prompt diversity and efficiency, lifting +image segmentation to the foundation model era. However, for high-quality +segmentations, SAM still lags behind state-of-the-art specialist models despite +SAM being trained with x100 more segmentation masks. In this work, we delve +deep into the architectural differences between the two types of models. We +observe that dense representation and fusion of visual prompts are the key +design choices contributing to the high segmentation quality of specialist +models. In light of this, we reintroduce this dense design into the generalist +models, to facilitate the development of generalist models with high +segmentation quality. To densely represent diverse visual prompts, we propose +to use a dense map to capture five types: clicks, boxes, polygons, scribbles, +and masks. Thus, we propose SegNext, a next-generation interactive segmentation +approach offering low latency, high quality, and diverse prompt support. Our +method outperforms current state-of-the-art methods on HQSeg-44K and DAVIS, +both quantitatively and qualitatively.",cs.CV,['cs.CV'] +Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior,Fangfu Liu · Diankun Wu · Yi Wei · Yongming Rao · Yueqi Duan,https://liuff19.github.io/Sherpa3D/,https://arxiv.org/abs/2312.06655,,2312.06655.pdf,Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior,"Recently, 3D content creation from text prompts has demonstrated remarkable +progress by utilizing 2D and 3D diffusion models. While 3D diffusion models +ensure great multi-view consistency, their ability to generate high-quality and +diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion +models find a distillation approach that achieves excellent generalization and +rich details without any 3D data. However, 2D lifting methods suffer from +inherent view-agnostic ambiguity thereby leading to serious multi-face Janus +issues, where text prompts fail to provide sufficient guidance to learn +coherent 3D results. Instead of retraining a costly viewpoint-aware model, we +study how to fully exploit easily accessible coarse 3D knowledge to enhance the +prompts and guide 2D lifting optimization for refinement. In this paper, we +propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, +generalizability, and geometric consistency simultaneously. Specifically, we +design a pair of guiding strategies derived from the coarse 3D prior generated +by the 3D diffusion model: a structural guidance for geometric fidelity and a +semantic guidance for 3D coherence. Employing the two types of guidance, the 2D +diffusion model enriches the 3D content with diversified and high-quality +results. Extensive experiments show the superiority of our Sherpa3D over the +state-of-the-art text-to-3D methods in terms of quality and 3D consistency.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Identifying Important Group of Pixels using Interactions,Kosuke Sumiyasu · Kazuhiko Kawamoto · Hiroshi Kera, ,https://arxiv.org/abs/2401.03785,,2401.03785.pdf,Identifying Important Group of Pixels using Interactions,"To better understand the behavior of image classifiers, it is useful to +visualize the contribution of individual pixels to the model prediction. In +this study, we propose a method, MoXI ($\textbf{Mo}$del e$\textbf{X}$planation +by $\textbf{I}$nteractions), that efficiently and accurately identifies a group +of pixels with high prediction confidence. The proposed method employs +game-theoretic concepts, Shapley values and interactions, taking into account +the effects of individual pixels and the cooperative influence of pixels on +model confidence. Theoretical analysis and experiments demonstrate that our +method better identifies the pixels that are highly contributing to the model +outputs than widely-used visualization by Grad-CAM, Attention rollout, and +Shapley value. While prior studies have suffered from the exponential +computational cost in the computation of Shapley value and interactions, we +show that this can be reduced to quadratic cost for our task. The code is +available at https://github.com/KosukeSumiyasu/MoXI.",cs.CV,"['cs.CV', 'cs.LG']" +DualAD: Disentangling the Dynamic and Static World for End-to-End Driving,Simon Doll · Niklas Hanselmann · Lukas Schneider · Richard Schulz · Marius Cordts · Markus Enzweiler · Hendrik Lensch, ,https://arxiv.org/html/2306.16927v2,,2306.16927v2.pdf,End-to-end Autonomous Driving: Challenges and Frontiers,"The autonomous driving community has witnessed a rapid growth in approaches +that embrace an end-to-end algorithm framework, utilizing raw sensor input to +generate vehicle motion plans, instead of concentrating on individual tasks +such as detection and motion prediction. End-to-end systems, in comparison to +modular pipelines, benefit from joint feature optimization for perception and +planning. This field has flourished due to the availability of large-scale +datasets, closed-loop evaluation, and the increasing need for autonomous +driving algorithms to perform effectively in challenging scenarios. In this +survey, we provide a comprehensive analysis of more than 270 papers, covering +the motivation, roadmap, methodology, challenges, and future trends in +end-to-end autonomous driving. We delve into several critical challenges, +including multi-modality, interpretability, causal confusion, robustness, and +world models, amongst others. Additionally, we discuss current advancements in +foundation models and visual pre-training, as well as how to incorporate these +techniques within the end-to-end driving framework. we maintain an active +repository that contains up-to-date literature and open-source projects at +https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG']" +A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing,Li Maomao · Yu Li · Tianyu Yang · Yunfei Liu · Dongxu Yue · Zhihui Lin · Dong Xu, ,https://arxiv.org/abs/2312.05856,,2312.05856.pdf,A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing,"This paper presents a video inversion approach for zero-shot video editing, +which models the input video with low-rank representation during the inversion +process. The existing video editing methods usually apply the typical 2D DDIM +inversion or naive spatial-temporal DDIM inversion before editing, which +leverages time-varying representation for each frame to derive noisy latent. +Unlike most existing approaches, we propose a Spatial-Temporal +Expectation-Maximization (STEM) inversion, which formulates the dense video +feature under an expectation-maximization manner and iteratively estimates a +more compact basis set to represent the whole video. Each frame applies the +fixed and global representation for inversion, which is more friendly for +temporal consistency during reconstruction and editing. Extensive qualitative +and quantitative experiments demonstrate that our STEM inversion can achieve +consistent improvement on two state-of-the-art video editing methods. Project +page: https://stem-inv.github.io/page/.",cs.CV,['cs.CV'] +"Time-, Memory- and Parameter-Efficient Visual Adaptation",Otniel-Bogdan Mercea · Alexey Gritsenko · Cordelia Schmid · Anurag Arnab, ,https://arxiv.org/abs/2402.02887,,2402.02887.pdf,"Time-, Memory- and Parameter-Efficient Visual Adaptation","As foundation models become more popular, there is a growing need to +efficiently finetune them for downstream tasks. Although numerous adaptation +methods have been proposed, they are designed to be efficient only in terms of +how many parameters are trained. They, however, typically still require +backpropagating gradients throughout the model, meaning that their +training-time and -memory cost does not reduce as significantly. We propose an +adaptation method which does not backpropagate gradients through the backbone. +We achieve this by designing a lightweight network in parallel that operates on +features from the frozen, pretrained backbone. As a result, our method is +efficient not only in terms of parameters, but also in training-time and memory +usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on +the popular VTAB benchmark, and we further show how we outperform prior works +with respect to training-time and -memory usage too. We further demonstrate the +training efficiency and scalability of our method by adapting a vision +transformer backbone of 4 billion parameters for the computationally demanding +task of video classification, without any intricate model parallelism. Here, we +outperform a prior adaptor-based method which could only scale to a 1 billion +parameter backbone, or fully-finetuning a smaller backbone, with the same GPU +and less training time.",cs.CV,"['cs.CV', 'cs.LG']" +WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification,Satish Kumar · Bowen Zhang · Chandrakanth Gudavalli · Connor Levenson · Lacey Hughey · Jared Stabach · Irene Amoke · Gordon Ojwang · Joseph Mukeka · Howard Frederick · Stephen Mwiu · Joseph Ochieng Ogutu · B S Manjunath, ,https://arxiv.org/abs/2311.12956,,2311.12956.pdf,Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for Advanced Object Detection,"In the realm of aerial image analysis, object detection plays a pivotal role, +with significant implications for areas such as remote sensing, urban planning, +and disaster management. This study addresses the inherent challenges in this +domain, notably the detection of small objects, managing densely packed +elements, and accounting for diverse orientations. We present an in-depth +evaluation of an object detection model that integrates the Large Selective +Kernel Network (LSKNet)as its backbone with the DiffusionDet head, utilizing +the iSAID dataset for empirical analysis. Our approach encompasses the +introduction of novel methodologies and extensive ablation studies. These +studies critically assess various aspects such as loss functions, box +regression techniques, and classification strategies to refine the model's +precision in object detection. The paper details the experimental application +of the LSKNet backbone in synergy with the DiffusionDet heads, a combination +tailored to meet the specific challenges in aerial image object detection. The +findings of this research indicate a substantial enhancement in the model's +performance, especially in the accuracy-time tradeoff. The proposed model +achieves a mean average precision (MAP) of approximately 45.7%, which is a +significant improvement, outperforming the RCNN model by 4.7% on the same +dataset. This advancement underscores the effectiveness of the proposed +modifications and sets a new benchmark in aerial image analysis, paving the way +for more accurate and efficient object detection methodologies. The code is +publicly available at https://github.com/SashaMatsun/LSKDiffDet",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball,Simon Weber · Barış Zöngür · Nikita Araslanov · Daniel Cremers, ,https://arxiv.org/abs/2404.03778,,2404.03778.pdf,Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball,"Hierarchy is a natural representation of semantic taxonomies, including the +ones routinely used in image segmentation. Indeed, recent work on semantic +segmentation reports improved accuracy from supervised training leveraging +hierarchical label structures. Encouraged by these results, we revisit the +fundamental assumptions behind that work. We postulate and then empirically +verify that the reasons for the observed improvement in segmentation accuracy +may be entirely unrelated to the use of the semantic hierarchy. To demonstrate +this, we design a range of cross-domain experiments with a representative +hierarchical approach. We find that on the new testing domains, a flat +(non-hierarchical) segmentation network, in which the parents are inferred from +the children, has superior segmentation accuracy to the hierarchical approach +across the board. Complementing these findings and inspired by the intrinsic +properties of hyperbolic spaces, we study a more principled approach to +hierarchical segmentation using the Poincar\'e ball model. The hyperbolic +representation largely outperforms the previous (Euclidean) hierarchical +approach as well and is on par with our flat Euclidean baseline in terms of +segmentation accuracy. However, it additionally exhibits surprisingly strong +calibration quality of the parent nodes in the semantic hierarchy, especially +on the more challenging domains. Our combined analysis suggests that the +established practice of hierarchical segmentation may be limited to in-domain +settings, whereas flat classifiers generalize substantially better, especially +if they are modeled in the hyperbolic space.",cs.CV,['cs.CV'] +OTE: Exploring Accurate Scene Text Recognition Using One Token,Jianjun Xu · Yuxin Wang · Hongtao Xie · Yongdong Zhang, ,https://arxiv.org/html/2403.07518v1,,2403.07518v1.pdf,Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss,"Scene text recognition is an important and challenging task in computer +vision. However, most prior works focus on recognizing pre-defined words, while +there are various out-of-vocabulary (OOV) words in real-world applications. + In this paper, we propose a novel open-vocabulary text recognition framework, +Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack +of OOV training data. To solve this problem, we first propose a pseudo label +generation module that leverages character detection and image inpainting to +produce substantial pseudo OOV training data from real-world images. Unlike +previous synthetic data, our pseudo OOV data contains real characters and +backgrounds to simulate real-world applications. Secondly, to reduce noises in +pseudo data, we present a semantic checking mechanism to filter semantically +meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the +training with pseudo data. Our loss includes a margin-based part to enhance the +classification ability, and a quality-aware part to penalize low-quality +samples in both real and pseudo data. + Extensive experiments demonstrate that our approach outperforms the +state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 +challenge.",cs.CV,['cs.CV'] +Language-Driven Anchors for Zero-Shot Adversarial Robustness,Xiao Li · Wei Zhang · Yining Liu · Zhanhao Hu · Bo Zhang · Xiaolin Hu,https://github.com/LixiaoTHU/LAAT,,https://paperswithcode.com/search?q=author:Xiaolin+Hu&order_by=stars,,,,,nan +DAVE -- A Detect-and-Verify Paradigm for Low-Shot Counting,Jer Pelhan · Alan Lukezic · Vitjan Zavrtanik · Matej Kristan, ,https://arxiv.org/abs/2404.16622,,2404.16622.pdf,DAVE -- A Detect-and-Verify Paradigm for Low-Shot Counting,"Low-shot counters estimate the number of objects corresponding to a selected +category, based on only few or no exemplars annotated in the image. The current +state-of-the-art estimates the total counts as the sum over the object location +density map, but does not provide individual object locations and sizes, which +are crucial for many applications. This is addressed by detection-based +counters, which, however fall behind in the total count accuracy. Furthermore, +both approaches tend to overestimate the counts in the presence of other object +classes due to many false positives. We propose DAVE, a low-shot counter based +on a detect-and-verify paradigm, that avoids the aforementioned issues by first +generating a high-recall detection set and then verifying the detections to +identify and remove the outliers. This jointly increases the recall and +precision, leading to accurate counts. DAVE outperforms the top density-based +counters by ~20% in the total count MAE, it outperforms the most recent +detection-based counter by ~20% in detection quality and sets a new +state-of-the-art in zero-shot as well as text-prompt-based counting.",cs.CV,['cs.CV'] +SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes,Soubhik Sanyal · Partha Ghosh · Jinlong Yang · Michael J. Black · Justus Thies · Timo Bolkart,https://sculpt.is.tue.mpg.de/,https://arxiv.org/html/2308.10638v2,,2308.10638v2.pdf,SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes,"We present SCULPT, a novel 3D generative model for clothed and textured 3D +meshes of humans. Specifically, we devise a deep neural network that learns to +represent the geometry and appearance distribution of clothed human bodies. +Training such a model is challenging, as datasets of textured 3D meshes for +humans are limited in size and accessibility. Our key observation is that there +exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image +datasets of clothed humans and multiple appearances can be mapped to a single +geometry. To effectively learn from the two data modalities, we propose an +unpaired learning procedure for pose-dependent clothed and textured human +meshes. Specifically, we learn a pose-dependent geometry space from 3D scan +data. We represent this as per vertex displacements w.r.t. the SMPL model. +Next, we train a geometry conditioned texture generator in an unsupervised way +using the 2D image data. We use intermediate activations of the learned +geometry model to condition our texture generator. To alleviate entanglement +between pose and clothing type, and pose and clothing appearance, we condition +both the texture and geometry generators with attribute labels such as clothing +types for the geometry, and clothing colors for the texture generator. We +automatically generated these conditioning labels for the 2D images based on +the visual question answering model BLIP and CLIP. We validate our method on +the SCULPT dataset, and compare to state-of-the-art 3D generative models for +clothed human bodies. Our code and data can be found at +https://sculpt.is.tue.mpg.de.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']" +Brush2Prompt: Contextual Prompt Generator for Object Inpainting,Mang Tik Chiu · Yuqian Zhou · Lingzhi Zhang · Zhe Lin · Connelly Barnes · Sohrab Amirghodsi · Eli Shechtman · Humphrey Shi, ,https://ar5iv.labs.arxiv.org/html/2204.07845,,2204.07845.pdf,Shape-guided Object Inpainting,"Previous works on image inpainting mainly focus on inpainting background or +partially missing objects, while the problem of inpainting an entire missing +object remains unexplored. This work studies a new image inpainting task, i.e. +shape-guided object inpainting. Given an incomplete input image, the goal is to +fill in the hole by generating an object based on the context and implicit +guidance given by the hole shape. Since previous methods for image inpainting +are mainly designed for background inpainting, they are not suitable for this +task. Therefore, we propose a new data preparation method and a novel +Contextual Object Generator (CogNet) for the object inpainting task. On the +data side, we incorporate object priors into training data by using object +instances as holes. The CogNet has a two-stream architecture that combines the +standard bottom-up image completion process with a top-down object generation +process. A predictive class embedding module bridges the two streams by +predicting the class of the missing object from the bottom-up features, from +which a semantic object map is derived as the input of the top-down stream. +Experiments demonstrate that the proposed method can generate realistic objects +that fit the context in terms of both visual appearance and semantic meanings. +Code can be found at the project page +\url{https://zengxianyu.github.io/objpaint}",cs.CV,"['cs.CV', 'cs.MM']" +AV-RIR: Audio-Visual Room Impulse Response Estimation,Anton Ratnarajah · Sreyan Ghosh · Sonal Kumar · Purva Chiniya · Dinesh Manocha,https://anton-jeran.github.io/AVRIR/,https://arxiv.org/abs/2312.00834,,2312.00834.pdf,AV-RIR: Audio-Visual Room Impulse Response Estimation,"Accurate estimation of Room Impulse Response (RIR), which captures an +environment's acoustic properties, is important for speech processing and AR/VR +applications. We propose AV-RIR, a novel multi-modal multi-task learning +approach to accurately estimate the RIR from a given reverberant speech signal +and the visual cues of its corresponding environment. AV-RIR builds on a novel +neural codec-based architecture that effectively captures environment geometry +and materials properties and solves speech dereverberation as an auxiliary task +by using multi-task learning. We also propose Geo-Mat features that augment +material information into visual cues and CRIP that improves late reverberation +components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical +results show that AV-RIR quantitatively outperforms previous audio-only and +visual-only approaches by achieving 36% - 63% improvement across various +acoustic metrics in RIR estimation. Additionally, it also achieves higher +preference scores in human evaluation. As an auxiliary benefit, dereverbed +speech from AV-RIR shows competitive performance with the state-of-the-art in +various spoken language processing tasks and outperforms reverberation time +error score in the real-world AVSpeech dataset. Qualitative examples of both +synthesized reverberant speech and enhanced speech can be found at +https://www.youtube.com/watch?v=tTsKhviukAE.",cs.SD,"['cs.SD', 'cs.CV']" +DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization,Jisu Nam · Heesu Kim · DongJae Lee · Siyoon Jin · Seungryong Kim · Seunggyu Chang, ,https://arxiv.org/abs/2402.09812,,2402.09812.pdf,DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization,"The objective of text-to-image (T2I) personalization is to customize a +diffusion model to a user-provided reference concept, generating diverse images +of the concept aligned with the target prompts. Conventional methods +representing the reference concepts using unique text embeddings often fail to +accurately mimic the appearance of the reference. To address this, one solution +may be explicitly conditioning the reference images into the target denoising +process, known as key-value replacement. However, prior works are constrained +to local editing since they disrupt the structure path of the pre-trained T2I +model. To overcome this, we propose a novel plug-in method, called +DreamMatcher, which reformulates T2I personalization as semantic matching. +Specifically, DreamMatcher replaces the target values with reference values +aligned by semantic matching, while leaving the structure path unchanged to +preserve the versatile capability of pre-trained T2I models for generating +diverse structures. We also introduce a semantic-consistent masking strategy to +isolate the personalized concept from irrelevant regions introduced by the +target prompts. Compatible with existing T2I models, DreamMatcher shows +significant improvements in complex scenarios. Intensive analyses demonstrate +the effectiveness of our approach.",cs.CV,['cs.CV'] +"FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features",Andre Rochow · Max Schwarz · Sven Behnke,https://andrerochow.github.io/fsrt,https://arxiv.org/abs/2404.09736,,2404.09736.pdf,"FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features","The task of face reenactment is to transfer the head motion and facial +expressions from a driving video to the appearance of a source image, which may +be of a different person (cross-reenactment). Most existing methods are +CNN-based and estimate optical flow from the source image to the current +driving frame, which is then inpainted and refined to produce the output +animation. We propose a transformer-based encoder for computing a set-latent +representation of the source image(s). We then predict the output color of a +query pixel using a transformer-based decoder, which is conditioned with +keypoints and a facial expression vector extracted from the driving frame. +Latent representations of the source person are learned in a self-supervised +manner that factorize their appearance, head pose, and facial expressions. +Thus, they are perfectly suited for cross-reenactment. In contrast to most +related work, our method naturally extends to multiple source images and can +thus adapt to person-specific facial dynamics. We also propose data +augmentation and regularization schemes that are necessary to prevent +overfitting and support generalizability of the learned representations. We +evaluated our approach in a randomized user study. The results indicate +superior performance compared to the state-of-the-art in terms of motion +transfer quality and temporal consistency.",cs.CV,['cs.CV'] +Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models,Hongjie Wang · Difan Liu · Yan Kang · Yijun Li · Zhe Lin · Niraj Jha · Yuchen Liu,https://atedm.github.io/,https://arxiv.org/abs/2405.05252,,2405.05252.pdf,Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models,"Diffusion Models (DMs) have exhibited superior performance in generating +high-quality and diverse images. However, this exceptional performance comes at +the cost of expensive architectural design, particularly due to the attention +module heavily used in leading models. Existing works mainly adopt a retraining +process to enhance DM efficiency. This is computationally expensive and not +very scalable. To this end, we introduce the Attention-driven Training-free +Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to +perform run-time pruning of redundant tokens, without the need for any +retraining. Specifically, for single-denoising-step pruning, we develop a novel +ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify +redundant tokens, and a similarity-based recovery method to restore tokens for +the convolution operation. In addition, we propose a Denoising-Steps-Aware +Pruning (DSAP) approach to adjust the pruning budget across different denoising +timesteps for better generation quality. Extensive evaluations show that AT-EDM +performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs +saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining +nearly the same FID and CLIP scores as the full model. Project webpage: +https://atedm.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'eess.IV', 'eess.SP']" +Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis,Atefeh Khoshkhahtinat · Ali Zafari · Piyush Mehta · Nasser Nasrabadi, ,https://arxiv.org/abs/2403.16258,,2403.16258.pdf,Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis,"While replacing Gaussian decoders with a conditional diffusion model enhances +the perceptual quality of reconstructions in neural image compression, their +lack of inductive bias for image data restricts their ability to achieve +state-of-the-art perceptual levels. To address this limitation, we adopt a +non-isotropic diffusion model at the decoder side. This model imposes an +inductive bias aimed at distinguishing between frequency contents, thereby +facilitating the generation of high-quality images. Moreover, our framework is +equipped with a novel entropy model that accurately models the probability +distribution of latent representation by exploiting spatio-channel correlations +in latent space, while accelerating the entropy decoding step. This +channel-wise entropy model leverages both local and global spatial contexts +within each channel chunk. The global spatial context is built upon the +Transformer, which is specifically designed for image compression tasks. The +designed Transformer employs a Laplacian-shaped positional encoding, the +learnable parameters of which are adaptively adjusted for each channel cluster. +Our experiments demonstrate that our proposed framework yields better +perceptual quality compared to cutting-edge generative-based codecs, and the +proposed entropy model contributes to notable bitrate savings.",eess.IV,"['eess.IV', 'cs.CV', 'cs.IT', 'cs.LG', 'math.IT']" +Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models,Pengze Zhang · Hubery Yin · Chen Li · Xiaohua Xie,https://pangzecheung.github.io/SingDiffusion/,https://arxiv.org/abs/2403.08381,,2403.08381.pdf,Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models,"Most diffusion models assume that the reverse process adheres to a Gaussian +distribution. However, this approximation has not been rigorously validated, +especially at singularities, where t=0 and t=1. Improperly dealing with such +singularities leads to an average brightness issue in applications, and limits +the generation of images with extreme brightness or darkness. We primarily +focus on tackling singularities from both theoretical and practical +perspectives. Initially, we establish the error bounds for the reverse process +approximation, and showcase its Gaussian characteristics at singularity time +steps. Based on this theoretical insight, we confirm the singularity at t=1 is +conditionally removable while it at t=0 is an inherent property. Upon these +significant conclusions, we propose a novel plug-and-play method SingDiffusion +to address the initial singular time step sampling, which not only effectively +resolves the average brightness issue for a wide range of diffusion models +without extra training efforts, but also enhances their generation capability +in achieving notable lower FID scores.",cs.CV,['cs.CV'] +"Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning",Nikhil Singh · Chih-Wei Wu · Iroro Orife · Kalayeh, ,https://arxiv.org/abs/2404.17753,,,Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification,"CLIP showcases exceptional cross-modal matching capabilities due to its +training on image-text contrastive learning tasks. However, without specific +optimization for unimodal scenarios, its performance in single-modality feature +extraction might be suboptimal. Despite this, some studies have directly used +CLIP's image encoder for tasks like few-shot classification, introducing a +misalignment between its pre-training objectives and feature extraction +methods. This inconsistency can diminish the quality of the image's feature +representation, adversely affecting CLIP's effectiveness in target tasks. In +this paper, we view text features as precise neighbors of image features in +CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) +based on the distance structure between images and their neighbor texts. This +feature extraction method aligns better with CLIP's pre-training objectives, +thereby fully leveraging CLIP's robust cross-modal capabilities. The key to +construct a high-quality CODER lies in how to create a vast amount of +high-quality and diverse texts to match with images. We introduce the Auto Text +Generator(ATG) to automatically generate the required texts in a data-free and +training-free manner. We apply CODER to CLIP's zero-shot and few-shot image +classification tasks. Experiment results across various datasets and models +confirm CODER's effectiveness. Code is available +at:https://github.com/YCaigogogo/CVPR24-CODER.",cs.CV,"['cs.CV', 'cs.AI']" +MaGGIe: Masked Guided Gradual Human Instance Matting,Chuong Huynh · Seoung Wug Oh · Abhinav Shrivastava · Joon-Young Lee,https://maggie-matt.github.io,https://arxiv.org/abs/2404.16035v1,,2404.16035v1.pdf,MaGGIe: Masked Guided Gradual Human Instance Matting,"Human matting is a foundation task in image and video processing, where human +foreground pixels are extracted from the input. Prior works either improve the +accuracy by additional guidance or improve the temporal consistency of a single +instance across frames. We propose a new framework MaGGIe, Masked Guided +Gradual Human Instance Matting, which predicts alpha mattes progressively for +each human instances while maintaining the computational cost, precision, and +consistency. Our method leverages modern architectures, including transformer +attention and sparse convolution, to output all instance mattes simultaneously +without exploding memory and latency. Although keeping constant inference costs +in the multiple-instance scenario, our framework achieves robust and versatile +performance on our proposed synthesized benchmarks. With the higher quality +image and video matting benchmarks, the novel multi-instance synthesis approach +from publicly available sources is introduced to increase the generalization of +models in real-world scenarios.",cs.CV,"['cs.CV', 'cs.AI']" +Aligning and Prompting Everything All at Once for Universal Visual Perception,Yunhang Shen · Chaoyou Fu · Peixian Chen · Mengdan Zhang · Ke Li · Xing Sun · Yunsheng Wu · Shaohui Lin · Rongrong Ji, ,https://arxiv.org/abs/2312.02153v1,,2312.02153v1.pdf,Aligning and Prompting Everything All at Once for Universal Visual Perception,"Vision foundation models have been explored recently to build general-purpose +vision systems. However, predominant paradigms, driven by casting +instance-level tasks as an object-word alignment, bring heavy cross-modality +interaction, which is not effective in prompting object detection and visual +grounding. Another line of work that focuses on pixel-level tasks often +encounters a large annotation gap of things and stuff, and suffers from mutual +interference between foreground-object and background-class segmentation. In +stark contrast to the prevailing methods, we present APE, a universal visual +perception model for aligning and prompting everything all at once in an image +to perform diverse tasks, i.e., detection, segmentation, and grounding, as an +instance-level sentence-object matching paradigm. Specifically, APE advances +the convergence of detection and grounding by reformulating language-guided +grounding as open-vocabulary detection, which efficiently scales up model +prompting to thousands of category vocabularies and region descriptions while +maintaining the effectiveness of cross-modality fusion. To bridge the +granularity gap of different pixel-level tasks, APE equalizes semantic and +panoptic segmentation to proxy instance learning by considering any isolated +regions as individual instances. APE aligns vision and language representation +on broad data with natural and challenging characteristics all at once without +task-specific fine-tuning. The extensive experiments on over 160 datasets +demonstrate that, with only one-suit of weights, APE outperforms (or is on par +with) the state-of-the-art models, proving that an effective yet universal +perception for anything aligning and prompting is indeed feasible. Codes and +trained models are released at https://github.com/shenyunhang/APE.",cs.CV,['cs.CV'] +A General and Efficient Training for Transformer via Token Expansion,Wenxuan Huang · Yunhang Shen · Jiao Xie · Baochang Zhang · Gaoqi He · Ke Li · Xing Sun · Shaohui Lin, ,https://arxiv.org/abs/2404.00672v1,,2404.00672v1.pdf,A General and Efficient Training for Transformer via Token Expansion,"The remarkable performance of Vision Transformers (ViTs) typically requires +an extremely large training cost. Existing methods have attempted to accelerate +the training of ViTs, yet typically disregard method universality with accuracy +dropping. Meanwhile, they break the training consistency of the original +transformers, including the consistency of hyper-parameters, architecture, and +strategy, which prevents them from being widely applied to different +Transformer networks. In this paper, we propose a novel token growth scheme +Token Expansion (termed ToE) to achieve consistent training acceleration for +ViTs. We introduce an ""initialization-expansion-merging"" pipeline to maintain +the integrity of the intermediate feature distribution of original +transformers, preventing the loss of crucial learnable information in the +training process. ToE can not only be seamlessly integrated into the training +and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also +effective for efficient training frameworks (e.g., EfficientTrain), without +twisting the original training hyper-parameters, architecture, and introducing +additional training strategies. Extensive experiments demonstrate that ToE +achieves about 1.3x faster for the training of ViTs in a lossless manner, or +even with performance gains over the full-token training baselines. Code is +available at https://github.com/Osilly/TokenExpansion .",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL', 'cs.CV']" +RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses,bedrettin cetinkaya · Sinan Kalkan · Emre Akbas,https://ranked-cvpr24.github.io/,https://arxiv.org/abs/2403.01795,,2403.01795.pdf,RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses,"Detecting edges in images suffers from the problems of (P1) heavy imbalance +between positive and negative classes as well as (P2) label uncertainty owing +to disagreement between different annotators. Existing solutions address P1 +using class-balanced cross-entropy loss and dice loss and P2 by only predicting +edges agreed upon by most annotators. In this paper, we propose RankED, a +unified ranking-based approach that addresses both the imbalance problem (P1) +and the uncertainty problem (P2). RankED tackles these two problems with two +components: One component which ranks positive pixels over negative pixels, and +the second which promotes high confidence edge pixels to have more label +certainty. We show that RankED outperforms previous studies and sets a new +state-of-the-art on NYUD-v2, BSDS500 and Multi-cue datasets. Code is available +at https://ranked-cvpr24.github.io.",cs.CV,['cs.CV'] +Solving the Catastrophic Forgetting Problem in Generalized Category Discovery,Xinzi Cao · Xiawu Zheng · Guanhong Wang · Weijiang Yu · Yunhang Shen · Ke Li · Yutong Lu · Yonghong Tian, ,https://arxiv.org/abs/2308.12112,,2308.12112.pdf,Generalized Continual Category Discovery,"Most of Continual Learning (CL) methods push the limit of supervised learning +settings, where an agent is expected to learn new labeled tasks and not forget +previous knowledge. However, these settings are not well aligned with real-life +scenarios, where a learning agent has access to a vast amount of unlabeled data +encompassing both novel (entirely unlabeled) classes and examples from known +classes. Drawing inspiration from Generalized Category Discovery (GCD), we +introduce a novel framework that relaxes this assumption. Precisely, in any +task, we allow for the existence of novel and known classes, and one must use +continual version of unsupervised learning methods to discover them. We call +this setting Generalized Continual Category Discovery (GCCD). It unifies CL and +GCD, bridging the gap between synthetic benchmarks and real-life scenarios. +With a series of experiments, we present that existing methods fail to +accumulate knowledge from subsequent tasks in which unlabeled samples of novel +classes are present. In light of these limitations, we propose a method that +incorporates both supervised and unsupervised signals and mitigates the +forgetting through the use of centroid adaptation. Our method surpasses strong +CL methods adopted for GCD techniques and presents a superior representation +learning performance.",cs.LG,"['cs.LG', 'cs.CV']" +Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning,Dipam Goswami · Albin Soutif · Yuyang Liu · Sandesh Kamath · Bartłomiej Twardowski · Joost van de Weijer,https://github.com/dipamgoswami/ADC,https://arxiv.org/abs/2405.19074,,2405.19074.pdf,Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning,"Continual learning methods are known to suffer from catastrophic forgetting, +a phenomenon that is particularly hard to counter for methods that do not store +exemplars of previous tasks. Therefore, to reduce potential drift in the +feature extractor, existing exemplar-free methods are typically evaluated in +settings where the first task is significantly larger than subsequent tasks. +Their performance drops drastically in more challenging settings starting with +a smaller first task. To address this problem of feature drift estimation for +exemplar-free methods, we propose to adversarially perturb the current samples +such that their embeddings are close to the old class prototypes in the old +model embedding space. We then estimate the drift in the embedding space from +the old to the new model using the perturbed images and compensate the +prototypes accordingly. We exploit the fact that adversarial samples are +transferable from the old to the new feature space in a continual learning +setting. The generation of these images is simple and computationally cheap. We +demonstrate in our experiments that the proposed approach better tracks the +movement of prototypes in embedding space and outperforms existing methods on +several standard continual learning benchmarks as well as on fine-grained +datasets. Code is available at https://github.com/dipamgoswami/ADC.",cs.CV,"['cs.CV', 'cs.AI']" +Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?,Zhengyue Zhao · Jinhao Duan · Kaidi Xu · Chenan Wang · Rui Zhang · Zidong Du · Qi Guo · Xing Hu, ,https://arxiv.org/abs/2312.00084,,2312.00084.pdf,Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?,"Stable Diffusion has established itself as a foundation model in generative +AI artistic applications, receiving widespread research and application. Some +recent fine-tuning methods have made it feasible for individuals to implant +personalized concepts onto the basic Stable Diffusion model with minimal +computational costs on small datasets. However, these innovations have also +given rise to issues like facial privacy forgery and artistic copyright +infringement. In recent studies, researchers have explored the addition of +imperceptible adversarial perturbations to images to prevent potential +unauthorized exploitation and infringements when personal data is used for +fine-tuning Stable Diffusion. Although these studies have demonstrated the +ability to protect images, it is essential to consider that these methods may +not be entirely applicable in real-world scenarios. In this paper, we +systematically evaluate the use of perturbations to protect images within a +practical threat model. The results suggest that these approaches may not be +sufficient to safeguard image privacy and copyright effectively. Furthermore, +we introduce a purification method capable of removing protected perturbations +while preserving the original image structure to the greatest extent possible. +Experiments reveal that Stable Diffusion can effectively learn from purified +images over all protective methods.",cs.CV,['cs.CV'] +MoST: Multi-modality Scene Tokenization for Motion Prediction,Norman Mu · Jingwei Ji · Zhenpei Yang · Nathan Harada · Haotian Tang · Kan Chen · Charles R. Qi · Runzhou Ge · Kratarth Goel · Zoey Yang · Scott Ettinger · Rami Al-Rfou · Dragomir Anguelov · Yin Zhou, ,http://export.arxiv.org/abs/2404.19531,,2404.19531.pdf,MoST: Multi-modality Scene Tokenization for Motion Prediction,"Many existing motion prediction approaches rely on symbolic perception +outputs to generate agent trajectories, such as bounding boxes, road graph +information and traffic lights. This symbolic representation is a high-level +abstraction of the real world, which may render the motion prediction model +vulnerable to perception errors (e.g., failures in detecting open-vocabulary +obstacles) while missing salient information from the scene context (e.g., poor +road conditions). An alternative paradigm is end-to-end learning from raw +sensors. However, this approach suffers from the lack of interpretability and +requires significantly more training resources. In this work, we propose +tokenizing the visual world into a compact set of scene elements and then +leveraging pre-trained image foundation models and LiDAR neural networks to +encode all the scene elements in an open-vocabulary manner. The image +foundation model enables our scene tokens to encode the general knowledge of +the open world while the LiDAR neural network encodes geometry information. Our +proposed representation can efficiently encode the multi-frame multi-modality +observations with a few hundred tokens and is compatible with most +transformer-based architectures. To evaluate our method, we have augmented +Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open +Motion Dataset show that our approach leads to significant performance +improvements over the state-of-the-art.",cs.CV,['cs.CV'] +Task-Driven Wavelets using Constrained Empirical Risk Minimization,Eric Marcus · Ray Sheombarsing · Jan-Jakob Sonke · Jonas Teuwen,https://github.com/NKI-AI/CERM,,https://aiforoncology.nl/news/2024-02-27/two-papers-accepted-at-cvpr-2024/,,,,,nan +Insights from the Use of Previously Unseen Neural Architecture Search Datasets,Rob Geada · David Towers · Matthew Forshaw · Amir Atapour-Abarghouei · Stephen McGough,https://github.com/Towers-D/NAS-Unseen-Datasets,https://arxiv.org/abs/2404.02189,,2404.02189.pdf,Insights from the Use of Previously Unseen Neural Architecture Search Datasets,"The boundless possibility of neural networks which can be used to solve a +problem -- each with different performance -- leads to a situation where a Deep +Learning expert is required to identify the best neural network. This goes +against the hope of removing the need for experts. Neural Architecture Search +(NAS) offers a solution to this by automatically identifying the best +architecture. However, to date, NAS work has focused on a small set of datasets +which we argue are not representative of real-world problems. We introduce +eight new datasets created for a series of NAS Challenges: AddNIST, Language, +MultNIST, CIFARTile, Gutenberg, Isabella, GeoClassing, and Chesseract. These +datasets and challenges are developed to direct attention to issues in NAS +development and to encourage authors to consider how their models will perform +on datasets unknown to them at development time. We present experimentation +using standard Deep Learning methods as well as the best results from challenge +participants.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach,Wei Dong · Xing Zhang · Bihui Chen · Dawei Yan · Zhijun Lin · Qingsen Yan · Peng Wang · Yang Yang, ,https://arxiv.org/abs/2403.19067,,2403.19067.pdf,Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach,"Parameter-efficient fine-tuning for pre-trained Vision Transformers aims to +adeptly tailor a model to downstream tasks by learning a minimal set of new +adaptation parameters while preserving the frozen majority of pre-trained +parameters. Striking a balance between retaining the generalizable +representation capacity of the pre-trained model and acquiring task-specific +features poses a key challenge. Currently, there is a lack of focus on guiding +this delicate trade-off. In this study, we approach the problem from the +perspective of Singular Value Decomposition (SVD) of pre-trained parameter +matrices, providing insights into the tuning dynamics of existing methods. +Building upon this understanding, we propose a Residual-based Low-Rank +Rescaling (RLRR) fine-tuning strategy. This strategy not only enhances +flexibility in parameter tuning but also ensures that new parameters do not +deviate excessively from the pre-trained model through a residual design. +Extensive experiments demonstrate that our method achieves competitive +performance across various downstream image classification tasks, all while +maintaining comparable new parameters. We believe this work takes a step +forward in offering a unified perspective for interpreting existing methods and +serves as motivation for the development of new approaches that move closer to +effectively considering the crucial trade-off mentioned above. Our code is +available at +\href{https://github.com/zstarN70/RLRR.git}{https://github.com/zstarN70/RLRR.git}.",cs.CV,['cs.CV'] +Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms,Joren Brunekreef · Eric Marcus · Ray Sheombarsing · Jan-Jakob Sonke · Jonas Teuwen,https://github.com/NKI-AI/kandinsky-calibration,https://arxiv.org/abs/2311.11837v1,,2311.11837v1.pdf,Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms,"Image segmentation algorithms can be understood as a collection of pixel +classifiers, for which the outcomes of nearby pixels are correlated. Classifier +models can be calibrated using Inductive Conformal Prediction, but this +requires holding back a sufficiently large calibration dataset for computing +the distribution of non-conformity scores of the model's predictions. If one +only requires only marginal calibration on the image level, this calibration +set consists of all individual pixels in the images available for calibration. +However, if the goal is to attain proper calibration for each individual pixel +classifier, the calibration set consists of individual images. In a scenario +where data are scarce (such as the medical domain), it may not always be +possible to set aside sufficiently many images for this pixel-level +calibration. The method we propose, dubbed ``Kandinsky calibration'', makes use +of the spatial structure present in the distribution of natural images to +simultaneously calibrate the classifiers of ``similar'' pixels. This can be +seen as an intermediate approach between marginal (imagewise) and conditional +(pixelwise) calibration, where non-conformity scores are aggregated over +similar image regions, thereby making more efficient use of the images +available for calibration. We run experiments on segmentation algorithms +trained and calibrated on subsets of the public MS-COCO and Medical Decathlon +datasets, demonstrating that Kandinsky calibration method can significantly +improve the coverage. When compared to both pixelwise and imagewise calibration +on little data, the Kandinsky method achieves much lower coverage errors, +indicating the data efficiency of the Kandinsky calibration.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion,Xinyu Zhan · Lixin Yang · Yifei Zhao · Kangrui Mao · Hanlin Xu · Zenan Lin · Kailin Li · Cewu Lu, ,,https://paperswithcode.com/paper/oakink2-a-dataset-of-bimanual-hands-object,,,,,nan +What Moves Together Belongs Together,Jenny Seidenschwarz · Aljoša Ošep · Francesco Ferroni · Simon Lucey · Laura Leal-Taixe,https://research.nvidia.com/labs/dvl/projects/semoli/,https://arxiv.org/abs/2402.19463,,2402.19463.pdf,SeMoLi: What Moves Together Belongs Together,"We tackle semi-supervised object detection based on motion cues. Recent +results suggest that heuristic-based clustering methods in conjunction with +object trackers can be used to pseudo-label instances of moving objects and use +these as supervisory signals to train 3D object detectors in Lidar data without +manual supervision. We re-think this approach and suggest that both, object +detection, as well as motion-inspired pseudo-labeling, can be tackled in a +data-driven manner. We leverage recent advances in scene flow estimation to +obtain point trajectories from which we extract long-term, class-agnostic +motion patterns. Revisiting correlation clustering in the context of message +passing networks, we learn to group those motion patterns to cluster points to +object instances. By estimating the full extent of the objects, we obtain +per-scan 3D bounding boxes that we use to supervise a Lidar object detection +network. Our method not only outperforms prior heuristic-based approaches (57.5 +AP, +14 improvement over prior work), more importantly, we show we can +pseudo-label and train object detectors across datasets.",cs.CV,['cs.CV'] +Stratified Avatar Generation from Sparse Observations,Han Feng · Wenchao Ma · Quankai Gao · Xianwei Zheng · Nan Xue · Huijuan Xu, ,https://arxiv.org/abs/2405.20786,,2405.20786.pdf,Stratified Avatar Generation from Sparse Observations,"Estimating 3D full-body avatars from AR/VR devices is essential for creating +immersive experiences in AR/VR applications. This task is challenging due to +the limited input from Head Mounted Devices, which capture only sparse +observations from the head and hands. Predicting the full-body avatars, +particularly the lower body, from these sparse observations presents +significant difficulties. In this paper, we are inspired by the inherent +property of the kinematic tree defined in the Skinned Multi-Person Linear +(SMPL) model, where the upper body and lower body share only one common +ancestor node, bringing the potential of decoupled reconstruction. We propose a +stratified approach to decouple the conventional full-body avatar +reconstruction pipeline into two stages, with the reconstruction of the upper +body first and a subsequent reconstruction of the lower body conditioned on the +previous stage. To implement this straightforward idea, we leverage the latent +diffusion model as a powerful probabilistic generator, and train it to follow +the latent distribution of decoupled motions explored by a VQ-VAE +encoder-decoder model. Extensive experiments on AMASS mocap dataset demonstrate +our state-of-the-art performance in the reconstruction of full-body motions.",cs.CV,"['cs.CV', 'cs.HC']" +VLP: Vision Language Planning for Autonomous Driving,Chenbin Pan · Burhaneddin Yaman · Tommaso Nesti · Abhirup Mallik · Alessandro G Allievi · Senem Velipasalar · Liu Ren, ,https://arxiv.org/abs/2401.05577,,2401.05577.pdf,VLP: Vision Language Planning for Autonomous Driving,"Autonomous driving is a complex and challenging task that aims at safe motion +planning through scene understanding and reasoning. While vision-only +autonomous driving methods have recently achieved notable performance, through +enhanced scene understanding, several key issues, including lack of reasoning, +low generalization performance and long-tail scenarios, still need to be +addressed. In this paper, we present VLP, a novel Vision-Language-Planning +framework that exploits language models to bridge the gap between linguistic +understanding and autonomous driving. VLP enhances autonomous driving systems +by strengthening both the source memory foundation and the self-driving car's +contextual understanding. VLP achieves state-of-the-art end-to-end planning +performance on the challenging NuScenes dataset by achieving 35.9\% and 60.5\% +reduction in terms of average L2 error and collision rates, respectively, +compared to the previous best method. Moreover, VLP shows improved performance +in challenging long-tail scenarios and strong generalization capabilities when +faced with new urban environments.",cs.CV,['cs.CV'] +Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding,Sai Wang · Yutian Lin · Yu Wu, ,https://ar5iv.labs.arxiv.org/html/2312.09625,,2312.09625.pdf,Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment,"Learning to ground natural language queries to target objects or regions in +3D point clouds is quite essential for 3D scene understanding. Nevertheless, +existing 3D visual grounding approaches require a substantial number of +bounding box annotations for text queries, which is time-consuming and +labor-intensive to obtain. In this paper, we propose \textbf{3D-VLA}, a weakly +supervised approach for \textbf{3D} visual grounding based on \textbf{V}isual +\textbf{L}inguistic \textbf{A}lignment. Our 3D-VLA exploits the superior +ability of current large-scale vision-language models (VLMs) on aligning the +semantics between texts and 2D images, as well as the naturally existing +correspondences between 2D images and 3D point clouds, and thus implicitly +constructs correspondences between texts and 3D point clouds with no need for +fine-grained box annotations in the training procedure. During the inference +stage, the learned text-3D correspondence will help us ground the text queries +to the 3D target objects even without 2D images. To the best of our knowledge, +this is the first work to investigate 3D visual grounding in a weakly +supervised manner by involving large scale vision-language models, and +extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our +3D-VLA achieves comparable and even superior results over the fully supervised +methods.",cs.CV,"['cs.CV', 'cs.CL']" +3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions,Weijia Li · Haote Yang · Zhenghao Hu · Juepeng Zheng · Gui-Song Xia · Conghui He, ,https://arxiv.org/abs/2404.04823,,2404.04823.pdf,3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions,"3D building reconstruction from monocular remote sensing images is an +important and challenging research problem that has received increasing +attention in recent years, owing to its low cost of data acquisition and +availability for large-scale applications. However, existing methods rely on +expensive 3D-annotated samples for fully-supervised training, restricting their +application to large-scale cross-city scenarios. In this work, we propose +MLS-BRN, a multi-level supervised building reconstruction network that can +flexibly utilize training samples with different annotation levels to achieve +better reconstruction results in an end-to-end manner. To alleviate the demand +on full 3D supervision, we design two new modules, Pseudo Building Bbox +Calculator and Roof-Offset guided Footprint Extractor, as well as new tasks and +training strategies for different types of samples. Experimental results on +several public and new datasets demonstrate that our proposed MLS-BRN achieves +competitive performance using much fewer 3D-annotated samples, and +significantly improves the footprint extraction and 3D reconstruction +performance compared with current state-of-the-art. The code and datasets of +this work will be released at https://github.com/opendatalab/MLS-BRN.git.",cs.CV,['cs.CV'] +SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects,Abhinav Kumar · Yuliang Guo · Xinyu Huang · Liu Ren · Xiaoming Liu,https://github.com/abhi1kumar/SeaBird,https://arxiv.org/abs/2403.20318,,2403.20318.pdf,SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects,"Monocular 3D detectors achieve remarkable performance on cars and smaller +objects. However, their performance drops on larger objects, leading to fatal +accidents. Some attribute the failures to training data scarcity or their +receptive field requirements of large objects. In this paper, we highlight this +understudied problem of generalization to large objects. We find that modern +frontal detectors struggle to generalize to large objects even on nearly +balanced datasets. We argue that the cause of failure is the sensitivity of +depth regression losses to noise of larger objects. To bridge this gap, we +comprehensively investigate regression and dice losses, examining their +robustness under varying error levels and object sizes. We mathematically prove +that the dice loss leads to superior noise-robustness and model convergence for +large objects compared to regression losses for a simplified case. Leveraging +our theoretical insights, we propose SeaBird (Segmentation in Bird's View) as +the first step towards generalizing to large objects. SeaBird effectively +integrates BEV segmentation on foreground objects for 3D detection, with the +segmentation head trained with the dice loss. SeaBird achieves SoTA results on +the KITTI-360 leaderboard and improves existing detectors on the nuScenes +leaderboard, particularly for large objects. Code and models at +https://github.com/abhi1kumar/SeaBird",cs.CV,"['cs.CV', 'cs.AI']" +Learning to Count without Annotations,Lukas Knobel · Tengda Han · Yuki Asano,https://github.com/lukasknobel/SelfCollages,https://web3.arxiv.org/abs/2307.08727,,2307.08727.pdf,Learning to Count without Annotations,"While recent supervised methods for reference-based object counting continue +to improve the performance on benchmark datasets, they have to rely on small +datasets due to the cost associated with manually annotating dozens of objects +in images. We propose UnCounTR, a model that can learn this task without +requiring any manual annotations. To this end, we construct ""Self-Collages"", +images with various pasted objects as training samples, that provide a rich +learning signal covering arbitrary object types and counts. Our method builds +on existing unsupervised representations and segmentation techniques to +successfully demonstrate for the first time the ability of reference-based +counting without manual supervision. Our experiments show that our method not +only outperforms simple baselines and generic models such as FasterRCNN and +DETR, but also matches the performance of supervised counting models in some +domains.",cs.CV,['cs.CV'] +AM-RADIO: Agglomerative Models - Reduce All Domains Into One,Mike Ranzinger · Greg Heinrich · Jan Kautz · Pavlo Molchanov,https://github.com/NVlabs/RADIO,https://arxiv.org/abs/2312.06709,,2312.06709.pdf,AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One,"A handful of visual foundation models (VFMs) have recently emerged as the +backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are +trained with distinct objectives, exhibiting unique characteristics for various +downstream tasks. We find that despite their conceptual differences, these +models can be effectively merged into a unified model through multi-teacher +distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All +Domains Into One). This integrative approach not only surpasses the performance +of individual teacher models but also amalgamates their distinctive features, +such as zero-shot vision-language comprehension, detailed pixel-level +understanding, and open vocabulary segmentation capabilities. In pursuit of the +most hardware-efficient backbone, we evaluated numerous architectures in our +multi-teacher distillation pipeline using the same training recipe. This led to +the development of a novel architecture (E-RADIO) that exceeds the performance +of its predecessors and is at least 7x faster than the teacher models. Our +comprehensive benchmarking process covers downstream tasks including ImageNet +classification, ADE20k semantic segmentation, COCO object detection and +LLaVa-1.5 framework. + Code: https://github.com/NVlabs/RADIO",cs.CV,['cs.CV'] +Activity-Biometrics: Person Identification from Daily Activities,Shehreen Azad · Yogesh S. Rawat, ,https://arxiv.org/abs/2403.17360,,2403.17360.pdf,Activity-Biometrics: Person Identification from Daily Activities,"In this work, we study a novel problem which focuses on person identification +while performing daily activities. Learning biometric features from RGB videos +is challenging due to spatio-temporal complexity and presence of appearance +biases such as clothing color and background. We propose ABNet, a novel +framework which leverages disentanglement of biometric and non-biometric +features to perform effective person identification from daily activities. +ABNet relies on a bias-less teacher to learn biometric features from RGB videos +and explicitly disentangle non-biometric features with the help of biometric +distortion. In addition, ABNet also exploits activity prior for biometrics +which is enabled by joint biometric and activity learning. We perform +comprehensive evaluation of the proposed approach across five different +datasets which are derived from existing activity recognition benchmarks. +Furthermore, we extensively compare ABNet with existing works in person +identification and demonstrate its effectiveness for activity-based biometrics +across all five datasets. The code and dataset can be accessed at: +\url{https://github.com/sacrcv/Activity-Biometrics/}",cs.CV,['cs.CV'] +Unsupervised Template-assisted Point Cloud Shape Correspondence Network,Jiacheng Deng · Jiahao Lu · Tianzhu Zhang, ,https://arxiv.org/abs/2403.16412,,2403.16412.pdf,Unsupervised Template-assisted Point Cloud Shape Correspondence Network,"Unsupervised point cloud shape correspondence aims to establish point-wise +correspondences between source and target point clouds. Existing methods obtain +correspondences directly by computing point-wise feature similarity between +point clouds. However, non-rigid objects possess strong deformability and +unusual shapes, making it a longstanding challenge to directly establish +correspondences between point clouds with unconventional shapes. To address +this challenge, we propose an unsupervised Template-Assisted point cloud shape +correspondence Network, termed TANet, including a template generation module +and a template assistance module. The proposed TANet enjoys several merits. +Firstly, the template generation module establishes a set of learnable +templates with explicit structures. Secondly, we introduce a template +assistance module that extensively leverages the generated templates to +establish more accurate shape correspondences from multiple perspectives. +Extensive experiments on four human and animal datasets demonstrate that TANet +achieves favorable performance against state-of-the-art methods.",cs.CV,['cs.CV'] +Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination,Yixin Zeng · Zoubin Bi · Yin Mingrui · Xiang Feng · Kun Zhou · Hongzhi Wu, ,https://arxiv.org/html/2404.10766v1,,2404.10766v1.pdf,RapidVol: Rapid Reconstruction of 3D Ultrasound Volumes from Sensorless 2D Scans,"Two-dimensional (2D) freehand ultrasonography is one of the most commonly +used medical imaging modalities, particularly in obstetrics and gynaecology. +However, it only captures 2D cross-sectional views of inherently 3D anatomies, +losing valuable contextual information. As an alternative to requiring costly +and complex 3D ultrasound scanners, 3D volumes can be constructed from 2D scans +using machine learning. However this usually requires long computational time. +Here, we propose RapidVol: a neural representation framework to speed up +slice-to-volume ultrasound reconstruction. We use tensor-rank decomposition, to +decompose the typical 3D volume into sets of tri-planes, and store those +instead, as well as a small neural network. A set of 2D ultrasound scans, with +their ground truth (or estimated) 3D position and orientation (pose) is all +that is required to form a complete 3D reconstruction. Reconstructions are +formed from real fetal brain scans, and then evaluated by requesting novel +cross-sectional views. When compared to prior approaches based on fully +implicit representation (e.g. neural radiance fields), our method is over 3x +quicker, 46% more accurate, and if given inaccurate poses is more robust. +Further speed-up is also possible by reconstructing from a structural prior +rather than from scratch.",eess.IV,"['eess.IV', 'cs.CV']" +FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences,Haobo Xu · Jun Zhou · Hua Yang · Renjie Pan · Cunyan Li, ,,https://www.researchgate.net/publication/376796777_Matching-to-Detecting_Establishing_Dense_and_Reliable_Correspondences_Between_Images,,,,,nan +CFAT: Unleashing Triangular Windows for Image Super-resolution,Abhisek Ray · Gaurav Kumar · Maheshkumar Kolekar, ,https://arxiv.org/abs/2403.16143,,2403.16143.pdf,CFAT: Unleashing TriangularWindows for Image Super-resolution,"Transformer-based models have revolutionized the field of image +super-resolution (SR) by harnessing their inherent ability to capture complex +contextual features. The overlapping rectangular shifted window technique used +in transformer architecture nowadays is a common practice in super-resolution +models to improve the quality and robustness of image upscaling. However, it +suffers from distortion at the boundaries and has limited unique shifting +modes. To overcome these weaknesses, we propose a non-overlapping triangular +window technique that synchronously works with the rectangular one to mitigate +boundary-level distortion and allows the model to access more unique sifting +modes. In this paper, we propose a Composite Fusion Attention Transformer +(CFAT) that incorporates triangular-rectangular window-based local attention +with a channel-based global attention technique in image super-resolution. As a +result, CFAT enables attention mechanisms to be activated on more image pixels +and captures long-range, multi-scale features to improve SR performance. The +extensive experimental results and ablation study demonstrate the effectiveness +of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB +performance improvement over other state-of-the-art SR architectures.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG', 'cs.MM']" +Deciphering ‘What’ and ‘Where’ Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations,Xiao Zhang · David Yunis · Michael Maire, ,https://arxiv.org/abs/2312.06716,,2312.06716.pdf,Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations,"We present an approach for analyzing grouping information contained within a +neural network's activations, permitting extraction of spatial layout and +semantic segmentation from the behavior of large pre-trained vision models. +Unlike prior work, our method conducts a wholistic analysis of a network's +activation state, leveraging features from all layers and obviating the need to +guess which part of the model contains relevant information. Motivated by +classic spectral clustering, we formulate this analysis in terms of an +optimization objective involving a set of affinity matrices, each formed by +comparing features within a different layer. Solving this optimization problem +using gradient descent allows our technique to scale from single images to +dataset-level analysis, including, in the latter, both intra- and inter-image +relationships. Analyzing a pre-trained generative transformer provides insight +into the computational strategy learned by such models. Equating affinity with +key-query similarity across attention layers yields eigenvectors encoding scene +spatial layout, whereas defining affinity by value vector similarity yields +eigenvectors encoding object identity. This result suggests that key and query +vectors coordinate attentional information flow according to spatial proximity +(a `where' pathway), while value vectors refine a semantic category +representation (a `what' pathway).",cs.CV,['cs.CV'] +DART: Implicit Doppler Tomography for Radar Novel View Synthesis,Tianshu Huang · John Miller · Akarsh Prabhakara · Tao Jin · Tarana Laroia · Zico Kolter · Anthony Rowe,https://wiselabcmu.github.io/dart/,https://arxiv.org/abs/2403.03896v1,,2403.03896v1.pdf,DART: Implicit Doppler Tomography for Radar Novel View Synthesis,"Simulation is an invaluable tool for radio-frequency system designers that +enables rapid prototyping of various algorithms for imaging, target detection, +classification, and tracking. However, simulating realistic radar scans is a +challenging task that requires an accurate model of the scene, radio frequency +material properties, and a corresponding radar synthesis function. Rather than +specifying these models explicitly, we propose DART - Doppler Aided Radar +Tomography, a Neural Radiance Field-inspired method which uses radar-specific +physics to create a reflectance and transmittance-based rendering pipeline for +range-Doppler images. We then evaluate DART by constructing a custom data +collection platform and collecting a novel radar dataset together with accurate +position and instantaneous velocity measurements from lidar-based localization. +In comparison to state-of-the-art baselines, DART synthesizes superior radar +range-Doppler images from novel views across all datasets and additionally can +be used to generate high quality tomographic images.",cs.CV,"['cs.CV', 'cs.LG']" +Don’t drop your samples! Coherence-aware training benefits Conditional diffusion,Nicolas Dufour · Victor Besnier · Vicky Kalogeiton · David Picard,https://nicolas-dufour.github.io/cad,https://arxiv.org/abs/2405.20324,,2405.20324.pdf,Don't drop your samples! Coherence-aware training benefits Conditional diffusion,"Conditional diffusion models are powerful generative models that can leverage +various types of conditional information, such as class labels, segmentation +masks, or text captions. However, in many real-world scenarios, conditional +information may be noisy or unreliable due to human annotation errors or weak +alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a +novel method that integrates coherence in conditional information into +diffusion models, allowing them to learn from noisy annotations without +discarding data. We assume that each data point has an associated coherence +score that reflects the quality of the conditional information. We then +condition the diffusion model on both the conditional information and the +coherence score. In this way, the model learns to ignore or discount the +conditioning when the coherence is low. We show that CAD is theoretically sound +and empirically effective on various conditional generation tasks. Moreover, we +show that leveraging coherence generates realistic and diverse samples that +respect conditional information better than models trained on cleaned datasets +where samples with low coherence have been discarded.",cs.CV,"['cs.CV', 'cs.LG']" +"Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation",Biao Gong · Siteng Huang · Yutong Feng · Shiwei Zhang · Yuyuan Li · Yu Liu, ,https://arxiv.org/abs/2311.15773,,2311.15773.pdf,"Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation","Diffusion models have recently achieved remarkable progress in generating +realistic images. However, challenges remain in accurately understanding and +synthesizing the layout requirements in the textual prompts. To align the +generated image with layout instructions, we present a training-free layout +calibration system SimM that intervenes in the generative process on the fly +during inference time. Specifically, following a ""check-locate-rectify"" +pipeline, the system first analyses the prompt to generate the target layout +and compares it with the intermediate outputs to automatically detect errors. +Then, by moving the located activations and making intra- and inter-map +adjustments, the rectification process can be performed with negligible +computational overhead. To evaluate SimM over a range of layout requirements, +we present a benchmark SimMBench that compensates for the lack of superlative +spatial relations in existing datasets. And both quantitative and qualitative +results demonstrate the effectiveness of the proposed SimM in calibrating the +layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM.",cs.CV,['cs.CV'] +GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos,Tomas Soucek · Dima Damen · Michael Wray · Ivan Laptev · Josef Sivic,https://soczech.github.io/genhowto/,https://arxiv.org/abs/2312.07322,,2312.07322.pdf,GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos,"We address the task of generating temporally consistent and physically +plausible images of actions and object state transformations. Given an input +image and a text prompt describing the targeted transformation, our generated +images preserve the environment and transform objects in the initial image. Our +contributions are threefold. First, we leverage a large body of instructional +videos and automatically mine a dataset of triplets of consecutive frames +corresponding to initial object states, actions, and resulting object +transformations. Second, equipped with this data, we develop and train a +conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a +variety of objects and actions and show superior performance compared to +existing methods. In particular, we introduce a quantitative evaluation where +GenHowTo achieves 88% and 74% on seen and unseen interaction categories, +respectively, outperforming prior work by a large margin.",cs.CV,['cs.CV'] +From Coarse to Fine-Grained Open-Set Recognition,Nico Lang · Vésteinn Snæbjarnarson · Elijah Cole · Oisin Mac Aodha · Christian Igel · Serge Belongie, ,https://arxiv.org/abs/2307.07214,,2307.07214.pdf,Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition,"Open-set image recognition is a challenging topic in computer vision. Most of +the existing works in literature focus on learning more discriminative features +from the input images, however, they are usually insensitive to the high- or +low-frequency components in features, resulting in a decreasing performance on +fine-grained image recognition. To address this problem, we propose a +Complementary Frequency-varying Awareness Network that could better capture +both high-frequency and low-frequency information, called CFAN. The proposed +CFAN consists of three sequential modules: (i) a feature extraction module is +introduced for learning preliminary features from the input images; (ii) a +frequency-varying filtering module is designed to separate out both high- and +low-frequency components from the preliminary features in the frequency domain +via a frequency-adjustable filter; (iii) a complementary temporal aggregation +module is designed for aggregating the high- and low-frequency components via +two Long Short-Term Memory networks into discriminative features. Based on +CFAN, we further propose an open-set fine-grained image recognition method, +called CFAN-OSFGR, which learns image features via CFAN and classifies them via +a linear classifier. Experimental results on 3 fine-grained datasets and 2 +coarse-grained datasets demonstrate that CFAN-OSFGR performs significantly +better than 9 state-of-the-art methods in most cases.",cs.CV,['cs.CV'] +FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,Jinglin Xu · Yijie Guo · Yuxin Peng,https://pku-icst-mipl.github.io/FinePOSE_ProjectPage/,https://arxiv.org/abs/2405.05216,,,FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,"The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to +predict human joint coordinates in 3D space. Despite recent advancements in +deep learning-based methods, they mostly ignore the capability of coupling +accessible texts and naturally feasible knowledge of humans, missing out on +valuable implicit supervision to guide the 3D HPE task. Moreover, previous +efforts often study this task from the perspective of the whole human body, +neglecting fine-grained guidance hidden in different body parts. To this end, +we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model +for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing +the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt +learning (FPP) block constructs fine-grained part-aware prompts via coupling +accessible texts and naturally feasible knowledge of body parts with learnable +prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication +(FPC) block establishes fine-grained communications between learned part-aware +prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp +Stylization (PTS) block integrates learned prompt embedding and temporal +information related to the noise level to enable adaptive adjustment at each +denoising step. Extensive experiments on public single-human pose estimation +datasets show that FinePOSE outperforms state-of-the-art methods. We further +extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE +on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with +complex multi-human scenarios. Code is available at +https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.",cs.CV,['cs.CV'] +FedMef: Towards Memory-efficient Federated Dynamic Pruning,Hong Huang · Weiming Zhuang · Chen Chen · Lingjuan Lyu, ,https://arxiv.org/abs/2403.14737,,2403.14737.pdf,FedMef: Towards Memory-efficient Federated Dynamic Pruning,"Federated learning (FL) promotes decentralized training while prioritizing +data confidentiality. However, its application on resource-constrained devices +is challenging due to the high demand for computation and memory resources to +train deep learning models. Neural network pruning techniques, such as dynamic +pruning, could enhance model efficiency, but directly adopting them in FL still +poses substantial challenges, including post-pruning performance degradation, +high activation memory usage, etc. To address these challenges, we propose +FedMef, a novel and memory-efficient federated dynamic pruning framework. +FedMef comprises two key components. First, we introduce the budget-aware +extrusion that maintains pruning efficiency while preserving post-pruning +performance by salvaging crucial information from parameters marked for pruning +within a given budget. Second, we propose scaled activation pruning to +effectively reduce activation memory footprints, which is particularly +beneficial for deploying FL to memory-limited devices. Extensive experiments +demonstrate the effectiveness of our proposed FedMef. In particular, it +achieves a significant reduction of 28.5% in memory footprint compared to +state-of-the-art methods while obtaining superior accuracy.",cs.LG,"['cs.LG', 'cs.DC']" +FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition,Sicheng Mo · Fangzhou Mu · Kuan Heng Lin · Yanli Liu · Bochen Guan · Yin Li · Bolei Zhou, ,https://arxiv.org/abs/2312.07536,,2312.07536.pdf,FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition,"Recent approaches such as ControlNet offer users fine-grained spatial control +over text-to-image (T2I) diffusion models. However, auxiliary modules have to +be trained for each type of spatial condition, model architecture, and +checkpoint, putting them at odds with the diverse intents and preferences a +human designer would like to convey to the AI models during the content +creation process. In this work, we present FreeControl, a training-free +approach for controllable T2I generation that supports multiple conditions, +architectures, and checkpoints simultaneously. FreeControl designs structure +guidance to facilitate the structure alignment with a guidance image, and +appearance guidance to enable the appearance sharing between images generated +using the same seed. Extensive qualitative and quantitative experiments +demonstrate the superior performance of FreeControl across a variety of +pre-trained T2I models. In particular, FreeControl facilitates convenient +training-free control over many different architectures and checkpoints, allows +the challenging input conditions on which most of the existing training-free +methods fail, and achieves competitive synthesis quality with training-based +approaches.",cs.CV,['cs.CV'] +Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space,Naveen Kumar Kummari · Reshmi Mitra · Krishna Mohan Chalavadi,https://github.com/NaveenKumar-1311/FCD,https://arxiv.org/abs/2307.08672,,2307.08672.pdf,FedDefender: Backdoor Attack Defense in Federated Learning,"Federated Learning (FL) is a privacy-preserving distributed machine learning +technique that enables individual clients (e.g., user participants, edge +devices, or organizations) to train a model on their local data in a secure +environment and then share the trained model with an aggregator to build a +global model collaboratively. In this work, we propose FedDefender, a defense +mechanism against targeted poisoning attacks in FL by leveraging differential +testing. Our proposed method fingerprints the neuron activations of clients' +models on the same input and uses differential testing to identify a +potentially malicious client containing a backdoor. We evaluate FedDefender +using MNIST and FashionMNIST datasets with 20 and 30 clients, and our results +demonstrate that FedDefender effectively mitigates such attacks, reducing the +attack success rate (ASR) to 10\% without deteriorating the global model +performance.",cs.CR,"['cs.CR', 'cs.AI', 'cs.CV', 'cs.LG']" +DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions,Yunxiao Shi · Manish Singh · Hong Cai · Fatih Porikli, ,https://arxiv.org/abs/2403.12202,,2403.12202.pdf,DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions,"In this paper, we introduce a novel approach that harnesses both 2D and 3D +attentions to enable highly accurate depth completion without requiring +iterative spatial propagations. Specifically, we first enhance a baseline +convolutional depth completion model by applying attention to 2D features in +the bottleneck and skip connections. This effectively improves the performance +of this simple network and sets it on par with the latest, complex +transformer-based models. Leveraging the initial depths and features from this +network, we uplift the 2D features to form a 3D point cloud and construct a 3D +point transformer to process it, allowing the model to explicitly learn and +exploit 3D geometric features. In addition, we propose normalization techniques +to process the point cloud, which improves learning and leads to better +accuracy than directly using point transformers off the shelf. Furthermore, we +incorporate global attention on downsampled point cloud features, which enables +long-range context while still being computationally feasible. We evaluate our +method, DeCoTR, on established depth completion benchmarks, including NYU Depth +V2 and KITTI, showcasing that it sets new state-of-the-art performance. We +further conduct zero-shot evaluations on ScanNet and DDAD benchmarks and +demonstrate that DeCoTR has superior generalizability compared to existing +approaches.",cs.CV,['cs.CV'] +TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing,Sherry X. Chen · Yaron Vaxman · Elad Ben Baruch · David Asulin · Aviad Moreshet · Kuo-Chin Lien · Misha Sra · Pradeep Sen,https://github.com/SherryXTChen/TiNO-Edit,https://arxiv.org/abs/2404.11120,,2404.11120.pdf,TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing,"Despite many attempts to leverage pre-trained text-to-image models (T2I) like +Stable Diffusion (SD) for controllable image editing, producing good +predictable results remains a challenge. Previous approaches have focused on +either fine-tuning pre-trained T2I models on specific datasets to generate +certain kinds of images (e.g., with a specific object or person), or on +optimizing the weights, text prompts, and/or learning features for each input +image in an attempt to coax the image generator to produce the desired result. +However, these approaches all have shortcomings and fail to produce good +results in a predictable and controllable manner. To address this problem, we +present TiNO-Edit, an SD-based method that focuses on optimizing the noise +patterns and diffusion timesteps during editing, something previously +unexplored in the literature. With this simple change, we are able to generate +results that both better align with the original images and reflect the desired +result. Furthermore, we propose a set of new loss functions that operate in the +latent domain of SD, greatly speeding up the optimization when compared to +prior approaches, which operate in the pixel domain. Our method can be easily +applied to variations of SD including Textual Inversion and DreamBooth that +encode new concepts and incorporate them into the edited results. We present a +host of image-editing capabilities enabled by our approach. Our code is +publicly available at https://github.com/SherryXTChen/TiNO-Edit.",cs.CV,['cs.CV'] +Memory-Scalable and Simplified Functional Map Learning,Robin Magnet · Maks Ovsjanikov, ,https://arxiv.org/abs/2404.00330,,2404.00330.pdf,Memory-Scalable and Simplified Functional Map Learning,"Deep functional maps have emerged in recent years as a prominent +learning-based framework for non-rigid shape matching problems. While early +methods in this domain only focused on learning in the functional domain, the +latest techniques have demonstrated that by promoting consistency between +functional and pointwise maps leads to significant improvements in accuracy. +Unfortunately, existing approaches rely heavily on the computation of large +dense matrices arising from soft pointwise maps, which compromises their +efficiency and scalability. To address this limitation, we introduce a novel +memory-scalable and efficient functional map learning pipeline. By leveraging +the specific structure of functional maps, we offer the possibility to achieve +identical results without ever storing the pointwise map in memory. +Furthermore, based on the same approach, we present a differentiable map +refinement layer adapted from an existing axiomatic refinement algorithm. +Unlike many functional map learning methods, which use this algorithm at a +post-processing step, ours can be easily used at train time, enabling to +enforce consistency between the refined and initial versions of the map. Our +resulting approach is both simpler, more efficient and more numerically stable, +by avoiding differentiation through a linear system, while achieving close to +state-of-the-art results in challenging scenarios.",cs.CV,"['cs.CV', 'cs.AI']" +FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment,Jinglin Xu · Sibo Yin · Guohao Zhao · Zishuo Wang · Yuxin Peng, ,https://arxiv.org/abs/2405.06887,,2405.06887.pdf,FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment,"Existing action quality assessment (AQA) methods mainly learn deep +representations at the video level for scoring diverse actions. Due to the lack +of a fine-grained understanding of actions in videos, they harshly suffer from +low credibility and interpretability, thus insufficient for stringent +applications, such as Olympic diving events. We argue that a fine-grained +understanding of actions requires the model to perceive and parse actions in +both time and space, which is also the key to the credibility and +interpretability of the AQA technique. Based on this insight, we propose a new +fine-grained spatial-temporal action parser named \textbf{FineParser}. It +learns human-centric foreground action representations by focusing on target +action regions within each frame and exploiting their fine-grained alignments +in time and space to minimize the impact of invalid backgrounds during the +assessment. In addition, we construct fine-grained annotations of human-centric +foreground action masks for the FineDiving dataset, called +\textbf{FineDiving-HM}. With refined annotations on diverse target action +procedures, FineDiving-HM can promote the development of real-world AQA +systems. Through extensive experiments, we demonstrate the effectiveness of +FineParser, which outperforms state-of-the-art methods while supporting more +tasks of fine-grained action understanding. Data and code are available at +\url{https://github.com/PKU-ICST-MIPL/FineParser_CVPR2024}.",cs.CV,['cs.CV'] +Spike-guided Motion Deblurring with Unknown Modal Spatiotemporal Alignment,Jiyuan Zhang · Shiyan Chen · Yajing Zheng · Zhaofei Yu · Tiejun Huang, ,https://arxiv.org/abs/2403.09486,,2403.09486.pdf,SpikeReveal: Unlocking Temporal Sequences from Real Blurry Inputs with Spike Streams,"Reconstructing a sequence of sharp images from the blurry input is crucial +for enhancing our insights into the captured scene and poses a significant +challenge due to the limited temporal features embedded in the image. Spike +cameras, sampling at rates up to 40,000 Hz, have proven effective in capturing +motion features and beneficial for solving this ill-posed problem. Nonetheless, +existing methods fall into the supervised learning paradigm, which suffers from +notable performance degradation when applied to real-world scenarios that +diverge from the synthetic training data domain. Moreover, the quality of +reconstructed images is capped by the generated images based on motion analysis +interpolation, which inherently differs from the actual scene, affecting the +generalization ability of these methods in real high-speed scenarios. To +address these challenges, we propose the first self-supervised framework for +the task of spike-guided motion deblurring. Our approach begins with the +formulation of a spike-guided deblurring model that explores the theoretical +relationships among spike streams, blurry images, and their corresponding sharp +sequences. We subsequently develop a self-supervised cascaded framework to +alleviate the issues of spike noise and spatial-resolution mismatching +encountered in the deblurring model. With knowledge distillation and +re-blurring loss, we further design a lightweight deblur network to generate +high-quality sequences with brightness and texture consistency with the +original input. Quantitative and qualitative experiments conducted on our +real-world and synthetic datasets with spikes validate the superior +generalization of the proposed framework. Our code, data and trained models +will be available at \url{https://github.com/chenkang455/S-SDM}.",cs.CV,['cs.CV'] +Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle,Hyeokjun Kweon · Jihun Kim · Kuk-Jin Yoon, ,,https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cit2.12239,,,,,nan +From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation,Hyeokjun Kweon · Kuk-Jin Yoon, ,https://arxiv.org/abs/2312.03585,,2312.03585.pdf,Foundation Model Assisted Weakly Supervised Semantic Segmentation,"This work aims to leverage pre-trained foundation models, such as contrastive +language-image pre-training (CLIP) and segment anything model (SAM), to address +weakly supervised semantic segmentation (WSSS) using image-level labels. To +this end, we propose a coarse-to-fine framework based on CLIP and SAM for +generating high-quality segmentation seeds. Specifically, we construct an image +classification task and a seed segmentation task, which are jointly performed +by CLIP with frozen weights and two sets of learnable task-specific prompts. A +SAM-based seeding (SAMS) module is designed and applied to each task to produce +either coarse or fine seed maps. Moreover, we design a multi-label contrastive +loss supervised by image-level labels and a CAM activation loss supervised by +the generated coarse seed map. These losses are used to learn the prompts, +which are the only parts need to be learned in our framework. Once the prompts +are learned, we input each image along with the learned segmentation-specific +prompts into CLIP and the SAMS module to produce high-quality segmentation +seeds. These seeds serve as pseudo labels to train an off-the-shelf +segmentation network like other two-stage WSSS methods. Experiments show that +our method achieves the state-of-the-art performance on PASCAL VOC 2012 and +competitive results on MS COCO 2014. Code is available at +https://github.com/HAL-42/FMA-WSSS.git.",cs.CV,"['cs.CV', 'cs.AI']" +SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System,Yunfei Fan · Tianyu Zhao · Guidong Wang,https://github.com/bytedance/SchurVINS,https://arxiv.org/abs/2312.01616,,2312.01616.pdf,SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System,"Accuracy and computational efficiency are the most important metrics to +Visual Inertial Navigation System (VINS). The existing VINS algorithms with +either high accuracy or low computational complexity, are difficult to provide +the high precision localization in resource-constrained devices. To this end, +we propose a novel filter-based VINS framework named SchurVINS, which could +guarantee both high accuracy by building a complete residual model and low +computational complexity with Schur complement. Technically, we first formulate +the full residual model where Gradient, Hessian and observation covariance are +explicitly modeled. Then Schur complement is employed to decompose the full +model into ego-motion residual model and landmark residual model. Finally, +Extended Kalman Filter (EKF) update is implemented in these two models with +high efficiency. Experiments on EuRoC and TUM-VI datasets show that our method +notably outperforms state-of-the-art (SOTA) methods in both accuracy and +computational complexity. The experimental code of SchurVINS is available at +https://github.com/bytedance/SchurVINS.",cs.CV,"['cs.CV', 'cs.RO']" +CAGE: Controllable Articulation GEneration,Jiayi Liu · Hou In Ivan Tam · Ali Mahdavi Amiri · Manolis Savva, ,https://arxiv.org/abs/2312.09570,,2312.09570.pdf,CAGE: Controllable Articulation GEneration,"We address the challenge of generating 3D articulated objects in a +controllable fashion. Currently, modeling articulated 3D objects is either +achieved through laborious manual authoring, or using methods from prior work +that are hard to scale and control directly. We leverage the interplay between +part shape, connectivity, and motion using a denoising diffusion-based method +with attention modules designed to extract correlations between part +attributes. Our method takes an object category label and a part connectivity +graph as input and generates an object's geometry and motion parameters. The +generated objects conform to user-specified constraints on the object category, +part shape, and part articulation. Our experiments show that our method +outperforms the state-of-the-art in articulated object generation, producing +more realistic objects while conforming better to user constraints. + Video Summary at: http://youtu.be/cH_rbKbyTpE",cs.CV,['cs.CV'] +Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces,Jiahong Wang · Yinwei DU · Stelian Coros · Bernhard Thomaszewski, ,https://arxiv.org/abs/2404.17620,,2404.17620.pdf,Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces,"We propose a self-supervised approach for learning physics-based subspaces +for real-time simulation. Existing learning-based methods construct subspaces +by approximating pre-defined simulation data in a purely geometric way. +However, this approach tends to produce high-energy configurations, leads to +entangled latent space dimensions, and generalizes poorly beyond the training +set. To overcome these limitations, we propose a self-supervised approach that +directly minimizes the system's mechanical energy during training. We show that +our method leads to learned subspaces that reflect physical equilibrium +constraints, resolve overfitting issues of previous methods, and offer +interpretable latent space parameters.",cs.LG,"['cs.LG', 'cs.CV', 'cs.GR']" +Beyond Average: Individualized Visual Scanpath Prediction,Xianyu Chen · Ming Jiang · Qi Zhao, ,https://arxiv.org/abs/2404.12235,,2404.12235.pdf,Beyond Average: Individualized Visual Scanpath Prediction,"Understanding how attention varies across individuals has significant +scientific and societal impacts. However, existing visual scanpath models treat +attention uniformly, neglecting individual differences. To bridge this gap, +this paper focuses on individualized scanpath prediction (ISP), a new attention +modeling task that aims to accurately predict how different individuals shift +their attention in diverse visual tasks. It proposes an ISP method featuring +three novel technical components: (1) an observer encoder to characterize and +integrate an observer's unique attention traits, (2) an observer-centric +feature integration approach that holistically combines visual features, task +guidance, and observer-specific characteristics, and (3) an adaptive fixation +prioritization mechanism that refines scanpath predictions by dynamically +prioritizing semantic feature maps based on individual observers' attention +traits. These novel components allow scanpath models to effectively address the +attention variations across different observers. Our method is generally +applicable to different datasets, model architectures, and visual tasks, +offering a comprehensive tool for transforming general scanpath models into +individualized ones. Comprehensive evaluations using value-based and +ranking-based metrics verify the method's effectiveness and generalizability.",cs.CV,['cs.CV'] +CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow,Chenbin Pan · Burhaneddin Yaman · Senem Velipasalar · Liu Ren, ,https://arxiv.org/abs/2403.08919,,2403.08919.pdf,CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow,"Autonomous driving stands as a pivotal domain in computer vision, shaping the +future of transportation. Within this paradigm, the backbone of the system +plays a crucial role in interpreting the complex environment. However, a +notable challenge has been the loss of clear supervision when it comes to +Bird's Eye View elements. To address this limitation, we introduce +CLIP-BEVFormer, a novel approach that leverages the power of contrastive +learning techniques to enhance the multi-view image-derived BEV backbones with +ground truth information flow. We conduct extensive experiments on the +challenging nuScenes dataset and showcase significant and consistent +improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive +8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the +previous best BEV model on the 3D object detection task.",cs.CV,['cs.CV'] +"OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition",Jianqiang Wan · Sibo Song · Wenwen Yu · Yuliang Liu · Wenqing Cheng · Fei Huang · Xiang Bai · Cong Yao · Zhibo Yang, ,https://arxiv.org/abs/2403.19128,,2403.19128.pdf,"OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition","Recently, visually-situated text parsing (VsTP) has experienced notable +advancements, driven by the increasing demand for automated document +understanding and the emergence of Generative Large Language Models (LLMs) +capable of processing document-based questions. Various methods have been +proposed to address the challenging problem of VsTP. However, due to the +diversified targets and heterogeneous schemas, previous works usually design +task-specific architectures and objectives for individual tasks, which +inadvertently leads to modal isolation and complex workflow. In this paper, we +propose a unified paradigm for parsing visually-situated text across diverse +scenarios. Specifically, we devise a universal model, called OmniParser, which +can simultaneously handle three typical visually-situated text parsing tasks: +text spotting, key information extraction, and table recognition. In +OmniParser, all tasks share the unified encoder-decoder architecture, the +unified objective: point-conditioned text generation, and the unified input & +output representation: prompt & structured sequences. Extensive experiments +demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or +highly competitive performances on 7 datasets for the three visually-situated +text parsing tasks, despite its unified, concise design. The code is available +at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.",cs.CV,['cs.CV'] +ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image,Kyle Sargent · Zizhang Li · Tanmay Shah · Charles Herrmann · Hong-Xing Yu · Yunzhi Zhang · Eric Ryan Chan · Dmitry Lagun · Li Fei-Fei · Deqing Sun · Jiajun Wu,kylesargent.github.io/zeronvs,https://arxiv.org/abs/2310.17994,,2310.17994.pdf,ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image,"We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view +synthesis for in-the-wild scenes. While existing methods are designed for +single objects with masked backgrounds, we propose new techniques to address +challenges introduced by in-the-wild multi-object scenes with complex +backgrounds. Specifically, we train a generative prior on a mixture of data +sources that capture object-centric, indoor, and outdoor scenes. To address +issues from data mixture such as depth-scale ambiguity, we propose a novel +camera conditioning parameterization and normalization scheme. Further, we +observe that Score Distillation Sampling (SDS) tends to truncate the +distribution of complex backgrounds during distillation of 360-degree scenes, +and propose ""SDS anchoring"" to improve the diversity of synthesized novel +views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset +in the zero-shot setting, even outperforming methods specifically trained on +DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark +for single-image novel view synthesis, and demonstrate strong performance in +this setting. Our code and data are at http://kylesargent.github.io/zeronvs/",cs.CV,"['cs.CV', 'cs.GR']" +Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds,Heejoon Moon · Chunghwan Lee · Je Hyeong Hong,https://github.com/PHANTOM0122/Ray-cloud,,https://ieeexplore.ieee.org/abstract/document/10203590,,,,,nan +BigGait: Learning Gait Representation You Want by Large Vision Models,Dingqiang Ye · Chao Fan · Jingzhe Ma · Xiaoming Liu · Shiqi Yu,https://github.com/ShiqiYu/OpenGait,https://arxiv.org/abs/2402.19122,,,BigGait: Learning Gait Representation You Want by Large Vision Models,"Gait recognition stands as one of the most pivotal remote identification +technologies and progressively expands across research and industry +communities. However, existing gait recognition methods heavily rely on +task-specific upstream driven by supervised learning to provide explicit gait +representations like silhouette sequences, which inevitably introduce expensive +annotation costs and potential error accumulation. Escaping from this trend, +this work explores effective gait representations based on the all-purpose +knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a +simple yet efficient gait framework, termed BigGait. Specifically, the Gait +Representation Extractor (GRE) within BigGait draws upon design principles from +established gait representations, effectively transforming all-purpose +knowledge into implicit gait representations without requiring third-party +supervision signals. Experiments on CCPG, CAISA-B* and SUSTech1K indicate that +BigGait significantly outperforms the previous methods in both within-domain +and cross-domain tasks in most cases, and provides a more practical paradigm +for learning the next-generation gait representation. Finally, we delve into +prospective challenges and promising directions in LVMs-based gait recognition, +aiming to inspire future work in this emerging topic. The source code is +available at https://github.com/ShiqiYu/OpenGait.",cs.CV,['cs.CV'] +ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding,Le Xue · Ning Yu · Shu Zhang · Artemis Panagopoulou · Junnan Li · Roberto Martín-Martín · Jiajun Wu · Caiming Xiong · Ran Xu · Juan Carlos Niebles · Silvio Savarese, ,https://ar5iv.labs.arxiv.org/html/2305.08275,,2305.08275.pdf,ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding,"Recent advancements in multimodal pre-training have shown promising efficacy +in 3D representation learning by aligning multimodal features across 3D shapes, +their 2D counterparts, and language descriptions. However, the methods used by +existing frameworks to curate such multimodal data, in particular language +descriptions for 3D shapes, are not scalable, and the collected language +descriptions are not diverse. To address this, we introduce ULIP-2, a simple +yet effective tri-modal pre-training framework that leverages large multimodal +models to automatically generate holistic language descriptions for 3D shapes. +It only needs 3D data as input, eliminating the need for any manual 3D +annotations, and is therefore scalable to large datasets. ULIP-2 is also +equipped with scaled-up backbones for better multimodal representation +learning. We conduct experiments on two large-scale 3D datasets, Objaverse and +ShapeNet, and augment them with tri-modal datasets of 3D point clouds, images, +and language for training ULIP-2. Experiments show that ULIP-2 demonstrates +substantial benefits in three downstream tasks: zero-shot 3D classification, +standard 3D classification with fine-tuning, and 3D captioning (3D-to-language +generation). It achieves a new SOTA of 50.6% (top-1) on Objaverse-LVIS and +84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN +benchmark for standard fine-tuning, ULIP-2 reaches an overall accuracy of 91.5% +with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a +new paradigm for scalable multimodal 3D representation learning without human +annotations and shows significant improvements over existing baselines. The +code and datasets are released at https://github.com/salesforce/ULIP.",cs.CV,['cs.CV'] +On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?,Maxime Zanella · Ismail Ben Ayed,https://github.com/MaxZanella/MTA,https://arxiv.org/abs/2405.02266,,2405.02266.pdf,On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?,"The development of large vision-language models, notably CLIP, has catalyzed +research into effective adaptation techniques, with a particular focus on soft +prompt tuning. Conjointly, test-time augmentation, which utilizes multiple +augmented views of a single image to enhance zero-shot generalization, is +emerging as a significant area of interest. This has predominantly directed +research efforts toward test-time prompt tuning. In contrast, we introduce a +robust MeanShift for Test-time Augmentation (MTA), which surpasses prompt-based +methods without requiring this intensive training procedure. This positions MTA +as an ideal solution for both standalone and API-based applications. +Additionally, our method does not rely on ad hoc rules (e.g., confidence +threshold) used in some previous test-time augmentation techniques to filter +the augmented views. Instead, MTA incorporates a quality assessment variable +for each view directly into its optimization process, termed as the inlierness +score. This score is jointly optimized with a density mode seeking process, +leading to an efficient training- and hyperparameter-free approach. We +extensively benchmark our method on 15 datasets and demonstrate MTA's +superiority and computational efficiency. Deployed easily as plug-and-play +module on top of zero-shot models and state-of-the-art few-shot methods, MTA +shows systematic and consistent improvements.",cs.CV,['cs.CV'] +Context-Guided Spatio-Temporal Video Grounding,Xin Gu · Heng Fan · Yan Huang · Tiejian Luo · Libo Zhang, ,https://arxiv.org/abs/2401.01578,,2401.01578.pdf,Context-Guided Spatio-Temporal Video Grounding,"Spatio-temporal video grounding (or STVG) task aims at locating a +spatio-temporal tube for a specific instance given a text query. Despite +advancements, current methods easily suffer the distractors or heavy object +appearance variations in videos due to insufficient object information from the +text, leading to degradation. Addressing this, we propose a novel framework, +context-guided STVG (CG-STVG), which mines discriminative instance context for +object in videos and applies it as a supplementary guidance for target +localization. The key of CG-STVG lies in two specially designed modules, +including instance context generation (ICG), which focuses on discovering +visual context information (in both appearance and motion) of the instance, and +instance context refinement (ICR), which aims to improve the instance context +from ICG by eliminating irrelevant or even harmful information from the +context. During grounding, ICG, together with ICR, are deployed at each +decoding stage of a Transformer architecture for instance context learning. +Particularly, instance context learned from one decoding stage is fed to the +next stage, and leveraged as a guidance containing rich and discriminative +object feature to enhance the target-awareness in decoding feature, which +conversely benefits generating better new instance context for improving +localization finally. Compared to existing methods, CG-STVG enjoys object +information in text query and guidance from mined instance visual context for +more accurate target localization. In our experiments on three benchmarks, +including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in +m_tIoU and m_vIoU on all of them, showing its efficacy. The code will be +released at https://github.com/HengLan/CGSTVG.",cs.CV,['cs.CV'] +GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions,Junjie Wang · Jiemin Fang · Xiaopeng Zhang · Lingxi Xie · Qi Tian, ,https://arxiv.org/abs/2311.16037,,2311.16037.pdf,GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions,"Recently, impressive results have been achieved in 3D scene editing with text +instructions based on a 2D diffusion model. However, current diffusion models +primarily generate images by predicting noise in the latent space, and the +editing is usually applied to the whole image, which makes it challenging to +perform delicate, especially localized, editing for 3D scenes. Inspired by +recent 3D Gaussian splatting, we propose a systematic framework, named +GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text +instructions. Benefiting from the explicit property of 3D Gaussians, we design +a series of techniques to achieve delicate editing. Specifically, we first +extract the region of interest (RoI) corresponding to the text instruction, +aligning it to 3D Gaussians. The Gaussian RoI is further used to control the +editing process. Our framework can achieve more delicate and precise editing of +3D scenes than previous methods while enjoying much faster training speed, i.e. +within 20 minutes on a single V100 GPU, more than twice as fast as +Instruct-NeRF2NeRF (45 minutes -- 2 hours).",cs.CV,"['cs.CV', 'cs.GR']" +GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis,Shunyuan Zheng · Boyao ZHOU · Ruizhi Shao · Boning Liu · Shengping Zhang · Liqiang Nie · Yebin Liu,https://shunyuanzheng.github.io/GPS-Gaussian,https://arxiv.org/abs/2312.02155,,2312.02155.pdf,GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis,"We present a new approach, termed GPS-Gaussian, for synthesizing novel views +of a character in a real-time manner. The proposed method enables 2K-resolution +rendering under a sparse-view camera setting. Unlike the original Gaussian +Splatting or neural implicit rendering methods that necessitate per-subject +optimizations, we introduce Gaussian parameter maps defined on the source views +and regress directly Gaussian Splatting properties for instant novel view +synthesis without any fine-tuning or optimization. To this end, we train our +Gaussian parameter regression module on a large amount of human scan data, +jointly with a depth estimation module to lift 2D parameter maps to 3D space. +The proposed framework is fully differentiable and experiments on several +datasets demonstrate that our method outperforms state-of-the-art methods while +achieving an exceeding rendering speed.",cs.CV,['cs.CV'] +OpenStreetView-5M: The Many Roads to Global Visual Geolocation,Guillaume Astruc · Nicolas Dufour · Ioannis Siglidis · Constantin Aronssohn · Nacim Bouia · Stephanie Fu · Romain Loiseau · Van Nguyen Nguyen · Charles Raude · Elliot Vincent · Lintao XU · Hongyu Zhou · Loic Landrieu,https://imagine.enpc.fr/~ioannis.siglidis/osv5m/,https://arxiv.org/abs/2404.18873v1,,2404.18873v1.pdf,OpenStreetView-5M: The Many Roads to Global Visual Geolocation,"Determining the location of an image anywhere on Earth is a complex visual +task, which makes it particularly relevant for evaluating computer vision +algorithms. Yet, the absence of standard, large-scale, open-access datasets +with reliably localizable images has limited its potential. To address this +issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset +comprising over 5.1 million geo-referenced street view images, covering 225 +countries and territories. In contrast to existing benchmarks, we enforce a +strict train/test separation, allowing us to evaluate the relevance of learned +geographical features beyond mere memorization. To demonstrate the utility of +our dataset, we conduct an extensive benchmark of various state-of-the-art +image encoders, spatial representations, and training strategies. All +associated codes and models can be found at https://github.com/gastruc/osv5m.",cs.CV,"['cs.CV', 'cs.AI']" +Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion,Zuoyue Li · Zhenqiang Li · Zhaopeng Cui · Marc Pollefeys · Martin R. Oswald, ,https://arxiv.org/abs/2401.10786,,2401.10786.pdf,Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion,"Directly generating scenes from satellite imagery offers exciting +possibilities for integration into applications like games and map services. +However, challenges arise from significant view changes and scene scale. +Previous efforts mainly focused on image or video generation, lacking +exploration into the adaptability of scene generation for arbitrary views. +Existing 3D generation works either operate at the object level or are +difficult to utilize the geometry obtained from satellite imagery. To overcome +these limitations, we propose a novel architecture for direct 3D scene +generation by introducing diffusion models into 3D sparse representations and +combining them with neural rendering techniques. Specifically, our approach +generates texture colors at the point level for a given geometry using a 3D +diffusion model first, which is then transformed into a scene representation in +a feed-forward manner. The representation can be utilized to render arbitrary +views which would excel in both single-frame quality and inter-frame +consistency. Experiments in two city-scale datasets show that our model +demonstrates proficiency in generating photo-realistic street-view image +sequences and cross-view urban scenes from satellite imagery.",cs.CV,['cs.CV'] +CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs,Haocheng Yuan · Jing Xu · Hao Pan · Adrien Bousseau · Niloy J. Mitra · Changjian Li,https://enigma-li.github.io/CADTalk/,https://arxiv.org/abs/2311.16703,,2311.16703.pdf,CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs,"CAD programs are a popular way to compactly encode shapes as a sequence of +operations that are easy to parametrically modify. However, without sufficient +semantic comments and structure, such programs can be challenging to +understand, let alone modify. We introduce the problem of semantic commenting +CAD programs, wherein the goal is to segment the input program into code blocks +corresponding to semantically meaningful shape parts and assign a semantic +label to each block. We solve the problem by combining program parsing with +visual-semantic analysis afforded by recent advances in foundational language +and vision models. Specifically, by executing the input programs, we create +shapes, which we use to generate conditional photorealistic images to make use +of semantic annotators for such images. We then distill the information across +the images and link back to the original programs to semantically comment on +them. Additionally, we collected and annotated a benchmark dataset, CADTalk, +consisting of 5,288 machine-made programs and 45 human-made programs with +ground truth semantic comments. We extensively evaluated our approach, compared +it to a GPT-based baseline, and an open-set shape segmentation baseline, and +reported an 83.24% accuracy on the new CADTalk dataset. Code and data: +https://enigma-li.github.io/CADTalk/.",cs.CV,"['cs.CV', 'cs.GR']" +MTLoRA: Low-Rank Adaptation Approach for Efficient Multi-Task Learning,Ahmed Agiza · Marina Neseem · Sherief Reda,https://github.com/scale-lab/MTLoRA,https://arxiv.org/abs/2403.20320,,2403.20320.pdf,MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning,"Adapting models pre-trained on large-scale datasets to a variety of +downstream tasks is a common strategy in deep learning. Consequently, +parameter-efficient fine-tuning methods have emerged as a promising way to +adapt pre-trained models to different tasks while training only a minimal +number of parameters. While most of these methods are designed for single-task +adaptation, parameter-efficient training in Multi-Task Learning (MTL) +architectures is still unexplored. In this paper, we introduce MTLoRA, a novel +framework for parameter-efficient training of MTL models. MTLoRA employs +Task-Agnostic and Task-Specific Low-Rank Adaptation modules, which effectively +disentangle the parameter space in MTL fine-tuning, thereby enabling the model +to adeptly handle both task specialization and interaction within MTL contexts. +We applied MTLoRA to hierarchical-transformer-based MTL architectures, adapting +them to multiple downstream dense prediction tasks. Our extensive experiments +on the PASCAL dataset show that MTLoRA achieves higher accuracy on downstream +tasks compared to fully fine-tuning the MTL model while reducing the number of +trainable parameters by 3.6x. Furthermore, MTLoRA establishes a Pareto-optimal +trade-off between the number of trainable parameters and the accuracy of the +downstream tasks, outperforming current state-of-the-art parameter-efficient +training methods in both accuracy and efficiency. Our code is publicly +available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning,Shilong Ou · Zhe Xue · Yawen Li · Meiyu Liang · Yuanqiang Cai · junjiang wu, ,https://arxiv.org/abs/2404.17340,,2404.17340.pdf,Masked Two-channel Decoupling Framework for Incomplete Multi-view Weak Multi-label Learning,"Multi-view learning has become a popular research topic in recent years, but +research on the cross-application of classic multi-label classification and +multi-view learning is still in its early stages. In this paper, we focus on +the complex yet highly realistic task of incomplete multi-view weak multi-label +learning and propose a masked two-channel decoupling framework based on deep +neural networks to solve this problem. The core innovation of our method lies +in decoupling the single-channel view-level representation, which is common in +deep multi-view learning methods, into a shared representation and a +view-proprietary representation. We also design a cross-channel contrastive +loss to enhance the semantic property of the two channels. Additionally, we +exploit supervised information to design a label-guided graph regularization +loss, helping the extracted embedding features preserve the geometric structure +among samples. Inspired by the success of masking mechanisms in image and text +analysis, we develop a random fragment masking strategy for vector features to +improve the learning ability of encoders. Finally, it is important to emphasize +that our model is fully adaptable to arbitrary view and label absences while +also performing well on the ideal full data. We have conducted sufficient and +convincing experiments to confirm the effectiveness and advancement of our +model.",cs.CV,['cs.CV'] +FineSports: A Multi-person Hierarchical Sports Video Dataset for Fine-grained Action Understanding,Jinglin Xu · Guohao Zhao · Sibo Yin · Wenhao Zhou · Yuxin Peng, ,,,,,,,nan +Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation,Mohammad Amin Shabani · Zhaowen Wang · Difan Liu · Nanxuan Zhao · Jimei Yang · Yasutaka Furukawa,https://aminshabani.github.io/visual_layout_composer/index.html,https://web3.arxiv.org/abs/2402.04754,,2402.04754.pdf,Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints,"Controllable layout generation refers to the process of creating a plausible +visual arrangement of elements within a graphic design (e.g., document and web +designs) with constraints representing design intentions. Although recent +diffusion-based models have achieved state-of-the-art FID scores, they tend to +exhibit more pronounced misalignment compared to earlier transformer-based +models. In this work, we propose the $\textbf{LA}$yout $\textbf{C}$onstraint +diffusion mod$\textbf{E}$l (LACE), a unified model to handle a broad range of +layout generation tasks, such as arranging elements with specified attributes +and refining or completing a coarse layout design. The model is based on +continuous diffusion models. Compared with existing methods that use discrete +diffusion models, continuous state-space design can enable the incorporation of +differentiable aesthetic constraint functions in training. For conditional +generation, we introduce conditions via masked input. Extensive experiment +results show that LACE produces high-quality layouts and outperforms existing +state-of-the-art baselines.",cs.CV,"['cs.CV', 'cs.LG']" +AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor,Sudong Cai,https://github.com/SudongCAI/AdaShift,,https://www.nature.com/articles/s41598-024-60598-2,,,,,nan +Generalized Event Cameras,Varun Sundar · Matthew Dutson · Andrei Ardelean · Claudio Bruschini · Edoardo Charbon · Mohit Gupta,https://wisionlab.com/project/generalized-event-cameras/,,https://aim.autm.net/public/project/73780/,,,,,nan +DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency,Heng Guo · Jieji Ren · Feishi Wang · Boxin Shi · Mingjun Ren · Yasuyuki Matsushita, ,,,,,,,nan +Instantaneous Perception of Moving Objects in 3D,Di Liu · Bingbing Zhuang · Dimitris N. Metaxas · Manmohan Chandraker, ,https://arxiv.org/abs/2405.02781,,2405.02781.pdf,Instantaneous Perception of Moving Objects in 3D,"The perception of 3D motion of surrounding traffic participants is crucial +for driving safety. While existing works primarily focus on general large +motions, we contend that the instantaneous detection and quantification of +subtle motions is equally important as they indicate the nuances in driving +behavior that may be safety critical, such as behaviors near a stop sign of +parking positions. We delve into this under-explored task, examining its unique +challenges and developing our solution, accompanied by a carefully designed +benchmark. Specifically, due to the lack of correspondences between consecutive +frames of sparse Lidar point clouds, static objects might appear to be moving - +the so-called swimming effect. This intertwines with the true object motion, +thereby posing ambiguity in accurate estimation, especially for subtle motions. +To address this, we propose to leverage local occupancy completion of object +point clouds to densify the shape cue, and mitigate the impact of swimming +artifacts. The occupancy completion is learned in an end-to-end fashion +together with the detection of moving objects and the estimation of their +motion, instantaneously as soon as objects start to move. Extensive experiments +demonstrate superior performance compared to standard 3D motion estimation +approaches, particularly highlighting our method's specialized treatment of +subtle motions.",cs.CV,['cs.CV'] +eTraM: Event-based Traffic Monitoring Dataset,Aayush Atul Verma · Bharatesh Chakravarthi · Arpitsinh Vaghela · Hua Wei · 'YZ' Yezhou Yang,https://eventbasedvision.github.io/eTraM/,https://arxiv.org/abs/2403.19976,,2403.19976.pdf,eTraM: Event-based Traffic Monitoring Dataset,"Event cameras, with their high temporal and dynamic range and minimal memory +usage, have found applications in various fields. However, their potential in +static traffic monitoring remains largely unexplored. To facilitate this +exploration, we present eTraM - a first-of-its-kind, fully event-based traffic +monitoring dataset. eTraM offers 10 hr of data from different traffic scenarios +in various lighting and weather conditions, providing a comprehensive overview +of real-world situations. Providing 2M bounding box annotations, it covers +eight distinct classes of traffic participants, ranging from vehicles to +pedestrians and micro-mobility. eTraM's utility has been assessed using +state-of-the-art methods for traffic participant detection, including RVT, RED, +and YOLOv8. We quantitatively evaluate the ability of event-based models to +generalize on nighttime and unseen scenes. Our findings substantiate the +compelling potential of leveraging event cameras for traffic monitoring, +opening new avenues for research and application. eTraM is available at +https://eventbasedvision.github.io/eTraM",cs.CV,['cs.CV'] +Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation,Hang Li · Chengzhi Shen · Philip H.S. Torr · Volker Tresp · Jindong Gu, ,https://arxiv.org/abs/2311.17216,,2311.17216.pdf,Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation,"Diffusion-based models have gained significant popularity for text-to-image +generation due to their exceptional image-generation capabilities. A risk with +these models is the potential generation of inappropriate content, such as +biased or harmful images. However, the underlying reasons for generating such +undesired content from the perspective of the diffusion model's internal +representation remain unclear. Previous work interprets vectors in an +interpretable latent space of diffusion models as semantic concepts. However, +existing approaches cannot discover directions for arbitrary concepts, such as +those related to inappropriate concepts. In this work, we propose a novel +self-supervised approach to find interpretable latent directions for a given +concept. With the discovered vectors, we further propose a simple approach to +mitigate inappropriate generation. Extensive experiments have been conducted to +verify the effectiveness of our mitigation approach, namely, for fair +generation, safe generation, and responsible text-enhancing generation. Project +page: \url{https://interpretdiffusion.github.io}.",cs.CV,['cs.CV'] +GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction,Xiao Chen · Quanyi Li · Tai Wang · Tianfan Xue · Jiangmiao Pang, ,https://arxiv.org/abs/2402.16174,,2402.16174.pdf,GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction,"While recent advances in neural radiance field enable realistic digitization +for large-scale scenes, the image-capturing process is still time-consuming and +labor-intensive. Previous works attempt to automate this process using the +Next-Best-View (NBV) policy for active 3D reconstruction. However, the existing +NBV policies heavily rely on hand-crafted criteria, limited action space, or +per-scene optimized representations. These constraints limit their +cross-dataset generalizability. To overcome them, we propose GenNBV, an +end-to-end generalizable NBV policy. Our policy adopts a reinforcement learning +(RL)-based framework and extends typical limited action space to 5D free space. +It empowers our agent drone to scan from any viewpoint, and even interact with +unseen geometries during training. To boost the cross-dataset generalizability, +we also propose a novel multi-source state embedding, including geometric, +semantic, and action representations. We establish a benchmark using the Isaac +Gym simulator with the Houses3K and OmniObject3D datasets to evaluate this NBV +policy. Experiments demonstrate that our policy achieves a 98.26% and 97.12% +coverage ratio on unseen building-scale objects from these datasets, +respectively, outperforming prior solutions.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation,Agneet Chatterjee · Tejas Gokhale · Chitta Baral · 'YZ' Yezhou Yang,https://agneetchatterjee.com/robustness_depth_lang/,https://arxiv.org/abs/2404.08540,,2404.08540.pdf,On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation,"Recent advances in monocular depth estimation have been made by incorporating +natural language as additional guidance. Although yielding impressive results, +the impact of the language prior, particularly in terms of generalization and +robustness, remains unexplored. In this paper, we address this gap by +quantifying the impact of this prior and introduce methods to benchmark its +effectiveness across various settings. We generate ""low-level"" sentences that +convey object-centric, three-dimensional spatial relationships, incorporate +them as additional language priors and evaluate their downstream impact on +depth estimation. Our key finding is that current language-guided depth +estimators perform optimally only with scene-level descriptions and +counter-intuitively fare worse with low level descriptions. Despite leveraging +additional data, these methods are not robust to directed adversarial attacks +and decline in performance with an increase in distribution shift. Finally, to +provide a foundation for future research, we identify points of failures and +offer insights to better understand these shortcomings. With an increasing +number of methods using language for depth estimation, our findings highlight +the opportunities and pitfalls that require careful consideration for effective +deployment in real-world settings",cs.CV,['cs.CV'] +Towards a Perceptual Evaluation Framework for Lighting Estimation,Justine Giroux · Mohammad Reza Karimi Dastjerdi · Yannick Hold-Geoffroy · Javier Vazquez-Corral · Jean-François Lalonde, ,https://arxiv.org/abs/2312.04334,,2312.04334.pdf,Towards a Perceptual Evaluation Framework for Lighting Estimation,"Progress in lighting estimation is tracked by computing existing image +quality assessment (IQA) metrics on images from standard datasets. While this +may appear to be a reasonable approach, we demonstrate that doing so does not +correlate to human preference when the estimated lighting is used to relight a +virtual scene into a real photograph. To study this, we design a controlled +psychophysical experiment where human observers must choose their preference +amongst rendered scenes lit using a set of lighting estimation algorithms +selected from the recent literature, and use it to analyse how these algorithms +perform according to human perception. Then, we demonstrate that none of the +most popular IQA metrics from the literature, taken individually, correctly +represent human perception. Finally, we show that by learning a combination of +existing IQA metrics, we can more accurately represent human preference. This +provides a new perceptual framework to help evaluate future lighting estimation +algorithms.",cs.CV,['cs.CV'] +HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video,Zicong Fan · Maria Parelli · Maria Kadoglou · Xu Chen · Muhammed Kocabas · Michael J. Black · Otmar Hilliges,https://zc-alexfan.github.io/hold,https://arxiv.org/abs/2311.18448v1,,2311.18448v1.pdf,HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video,"Since humans interact with diverse objects every day, the holistic 3D capture +of these interactions is important to understand and model human behaviour. +However, most existing methods for hand-object reconstruction from RGB either +assume pre-scanned object templates or heavily rely on limited 3D hand-object +data, restricting their ability to scale and generalize to more unconstrained +interaction settings. To this end, we introduce HOLD -- the first +category-agnostic method that reconstructs an articulated hand and object +jointly from a monocular interaction video. We develop a compositional +articulated implicit model that can reconstruct disentangled 3D hand and object +from 2D images. We also further incorporate hand-object constraints to improve +hand-object poses and consequently the reconstruction quality. Our method does +not rely on 3D hand-object annotations while outperforming fully-supervised +baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we +qualitatively show its robustness in reconstructing from in-the-wild videos. +Code: https://github.com/zc-alexfan/hold",cs.CV,['cs.CV'] +Language-driven All-in-one Adverse Weather Removal,Hao Yang · Liyuan Pan · Yan Yang · Wei Liang, ,https://arxiv.org/abs/2312.01381,,2312.01381.pdf,Language-driven All-in-one Adverse Weather Removal,"All-in-one (AiO) frameworks restore various adverse weather degradations with +a single set of networks jointly. To handle various weather conditions, an AiO +framework is expected to adaptively learn weather-specific knowledge for +different degradations and shared knowledge for common patterns. However, +existing methods: 1) rely on extra supervision signals, which are usually +unknown in real-world applications; 2) employ fixed network structures, which +restrict the diversity of weather-specific knowledge. In this paper, we propose +a Language-driven Restoration framework (LDR) to alleviate the aforementioned +issues. First, we leverage the power of pre-trained vision-language (PVL) +models to enrich the diversity of weather-specific knowledge by reasoning about +the occurrence, type, and severity of degradation, generating description-based +degradation priors. Then, with the guidance of degradation prior, we sparsely +select restoration experts from a candidate list dynamically based on a +Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the +weather-specific and shared knowledge to handle various weather conditions +(e.g., unknown or mixed weather). Experiments on extensive restoration +scenarios show our superior performance (see Fig. 1). The source code will be +made available.",cs.CV,['cs.CV'] +Gaussian Splatting SLAM,Hidenobu Matsuki · Riku Murai · Paul Kelly · Andrew J. Davison,https://rmurai.co.uk/projects/GaussianSplattingSLAM/,https://arxiv.org/abs/2312.06741,,2312.06741.pdf,Gaussian Splatting SLAM,"We present the first application of 3D Gaussian Splatting in monocular SLAM, +the most fundamental but the hardest setup for Visual SLAM. Our method, which +runs live at 3fps, utilises Gaussians as the only 3D representation, unifying +the required representation for accurate, efficient tracking, mapping, and +high-quality rendering. Designed for challenging monocular settings, our +approach is seamlessly extendable to RGB-D SLAM when an external depth sensor +is available. Several innovations are required to continuously reconstruct 3D +scenes with high fidelity from a live camera. First, to move beyond the +original 3DGS algorithm, which requires accurate poses from an offline +Structure from Motion (SfM) system, we formulate camera tracking for 3DGS using +direct optimisation against the 3D Gaussians, and show that this enables fast +and robust tracking with a wide basin of convergence. Second, by utilising the +explicit nature of the Gaussians, we introduce geometric verification and +regularisation to handle the ambiguities occurring in incremental 3D dense +reconstruction. Finally, we introduce a full SLAM system which not only +achieves state-of-the-art results in novel view synthesis and trajectory +estimation but also reconstruction of tiny and even transparent objects.",cs.CV,"['cs.CV', 'cs.RO']" +Backdoor Defense via Test-Time Detecting and Repairing,Jiyang Guan · Jian Liang · Ran He, ,https://arxiv.org/abs/2308.06107,,2308.06107.pdf,Test-Time Backdoor Defense via Detecting and Repairing,"Deep neural networks have played a crucial part in many critical domains, +such as autonomous driving, face recognition, and medical diagnosis. However, +deep neural networks are facing security threats from backdoor attacks and can +be manipulated into attacker-decided behaviors by the backdoor attacker. To +defend the backdoor, prior research has focused on using clean data to remove +backdoor attacks before model deployment. In this paper, we investigate the +possibility of defending against backdoor attacks at test time by utilizing +partially poisoned data to remove the backdoor from the model. To address the +problem, a two-stage method Test-Time Backdoor Defense (TTBD) is proposed. In +the first stage, we propose a backdoor sample detection method DDP to identify +poisoned samples from a batch of mixed, partially poisoned samples. Once the +poisoned samples are detected, we employ Shapley estimation to calculate the +contribution of each neuron's significance in the network, locate the poisoned +neurons, and prune them to remove backdoor in the models. Our experiments +demonstrate that TTBD removes the backdoor successfully with only a batch of +partially poisoned data across different model architectures and datasets +against different types of backdoor attacks.",cs.CR,['cs.CR'] +XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies,Xuanchi Ren · Jiahui Huang · Xiaohui Zeng · Ken Museth · Sanja Fidler · Francis Williams, ,https://arxiv.org/abs/2312.03806,,2312.03806.pdf,XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies,"We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for +high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can +generate millions of voxels with a finest effective resolution of up to +$1024^3$ in a feed-forward fashion without time-consuming test-time +optimization. To achieve this, we employ a hierarchical voxel latent diffusion +model which generates progressively higher resolution grids in a coarse-to-fine +manner using a custom framework built on the highly efficient VDB data +structure. Apart from generating high-resolution objects, we demonstrate the +effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m +with a voxel size as small as 10cm. We observe clear qualitative and +quantitative improvements over past approaches. In addition to unconditional +generation, we show that our model can be used to solve a variety of tasks such +as user-guided editing, scene completion from a single scan, and text-to-3D. +More results and details can be found at +https://research.nvidia.com/labs/toronto-ai/xcube/.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +MultiPhys: Multi-Person Physics-aware 3D Motion Estimation,Nicolás Ugrinovic · Boxiao Pan · Georgios Pavlakos · Despoina Paschalidou · Bokui Shen · Jordi Sanchez-Riera · Francesc Moreno-Noguer · Leonidas Guibas, ,https://arxiv.org/abs/2404.11987,,2404.11987.pdf,MultiPhys: Multi-Person Physics-aware 3D Motion Estimation,"We introduce MultiPhys, a method designed for recovering multi-person motion +from monocular videos. Our focus lies in capturing coherent spatial placement +between pairs of individuals across varying degrees of engagement. MultiPhys, +being physically aware, exhibits robustness to jittering and occlusions, and +effectively eliminates penetration issues between the two individuals. We +devise a pipeline in which the motion estimated by a kinematic-based method is +fed into a physics simulator in an autoregressive manner. We introduce distinct +components that enable our model to harness the simulator's properties without +compromising the accuracy of the kinematic estimates. This results in final +motion estimates that are both kinematically coherent and physically compliant. +Extensive evaluations on three challenging datasets characterized by +substantial inter-person interaction show that our method significantly reduces +errors associated with penetration and foot skating, while performing +competitively with the state-of-the-art on motion accuracy and smoothness. +Results and code can be found on our project page +(http://www.iri.upc.edu/people/nugrinovic/multiphys/).",cs.CV,['cs.CV'] +Implicit Motion Function,Yue Gao · Jiahao Li · Lei Chu · Yan Lu, ,,https://ieeexplore.ieee.org/document/10378136/citations?tabFilter=papers,,,,,nan +Improving Generalized Zero-Shot Learning by Exploring the Diverse Semantics from External Class Names,Yapeng Li · Yong Luo · Zengmao Wang · Bo Du, ,,https://ieeexplore.ieee.org/document/10283906,,,,,nan +Unsupervised 3D Structure Inference from Category-Specific Image Collections,Weikang Wang · Dongliang Cao · Florian Bernard,https://wei-kang-wang.github.io/unsuper3Dstructure/,,,,,,,nan +Text-Driven Image Editing via Learnable Regions,Yuanze Lin · Yi-Wen Chen · Yi-Hsuan Tsai · Lu Jiang · Ming-Hsuan Yang,https://yuanze-lin.me/LearnableRegions_page/,https://arxiv.org/abs/2311.16432,,2311.16432.pdf,Text-Driven Image Editing via Learnable Regions,"Language has emerged as a natural interface for image editing. In this paper, +we introduce a method for region-based image editing driven by textual prompts, +without the need for user-provided masks or sketches. Specifically, our +approach leverages an existing pre-trained text-to-image model and introduces a +bounding box generator to identify the editing regions that are aligned with +the textual prompts. We show that this simple approach enables flexible editing +that is compatible with current image generation models, and is able to handle +complex prompts featuring multiple objects, complex sentences, or lengthy +paragraphs. We conduct an extensive user study to compare our method against +state-of-the-art methods. The experiments demonstrate the competitive +performance of our method in manipulating images with high fidelity and realism +that correspond to the provided language descriptions. Our project webpage can +be found at: https://yuanze-lin.me/LearnableRegions_page.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Balancing Act: Distribution-Guided Debiasing in Diffusion Models,Rishubh Parihar · Abhijnya Bhat · Abhipsa Basu · Saswat Mallick · Jogendra Kundu Kundu · R. Venkatesh Babu, ,https://arxiv.org/abs/2402.18206,,2402.18206.pdf,Balancing Act: Distribution-Guided Debiasing in Diffusion Models,"Diffusion Models (DMs) have emerged as powerful generative models with +unprecedented image generation capability. These models are widely used for +data augmentation and creative applications. However, DMs reflect the biases +present in the training datasets. This is especially concerning in the context +of faces, where the DM prefers one demographic subgroup vs others (eg. female +vs male). In this work, we present a method for debiasing DMs without relying +on additional data or model retraining. Specifically, we propose Distribution +Guidance, which enforces the generated images to follow the prescribed +attribute distribution. To realize this, we build on the key insight that the +latent features of denoising UNet hold rich demographic semantics, and the same +can be leveraged to guide debiased generation. We train Attribute Distribution +Predictor (ADP) - a small mlp that maps the latent features to the distribution +of attributes. ADP is trained with pseudo labels generated from existing +attribute classifiers. The proposed Distribution Guidance with ADP enables us +to do fair generation. Our method reduces bias across single/multiple +attributes and outperforms the baseline by a significant margin for +unconditional and text-conditional diffusion models. Further, we present a +downstream task of training a fair attribute classifier by rebalancing the +training set with our generated data.",cs.CV,['cs.CV'] +Close Imitation of Expert Retouching for Black-and-White Photography,Seunghyun Shin · Jisu Shin · Jihwan Bae · Inwook Shim · Hae-Gon Jeon,https://github.com/seunghyuns98/Decolorization,,https://retouchinglabs.com/retouching-black-and-white-photos/,,,,,nan +Generative Image Dynamics,Zhengqi Li · Richard Tucker · Noah Snavely · Aleksander Holynski, ,https://arxiv.org/abs/2309.07906,,2309.07906.pdf,Generative Image Dynamics,"We present an approach to modeling an image-space prior on scene motion. Our +prior is learned from a collection of motion trajectories extracted from real +video sequences depicting natural, oscillatory dynamics such as trees, flowers, +candles, and clothes swaying in the wind. We model this dense, long-term motion +prior in the Fourier domain:given a single image, our trained model uses a +frequency-coordinated diffusion sampling process to predict a spectral volume, +which can be converted into a motion texture that spans an entire video. Along +with an image-based rendering module, these trajectories can be used for a +number of downstream applications, such as turning still images into seamlessly +looping videos, or allowing users to realistically interact with objects in +real pictures by interpreting the spectral volumes as image-space modal bases, +which approximate object dynamics.",cs.CV,['cs.CV'] +RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control,xiang deng · Zerong Zheng · Yuxiang Zhang · Jingxiang Sun · Chao Xu · Xiaodong Yang · Lizhen Wang · Yebin Liu, ,https://arxiv.org/html/2303.10275v2,,2303.10275v2.pdf,MoRF: Mobile Realistic Fullbody Avatars from a Monocular Video,"We present a system to create Mobile Realistic Fullbody (MoRF) avatars. MoRF +avatars are rendered in real-time on mobile devices, learned from monocular +videos, and have high realism. We use SMPL-X as a proxy geometry and render it +with DNR (neural texture and image-2-image network). We improve on prior work, +by overfitting per-frame warping fields in the neural texture space, allowing +to better align the training signal between different frames. We also refine +SMPL-X mesh fitting procedure to improve the overall avatar quality. In the +comparisons to other monocular video-based avatar systems, MoRF avatars achieve +higher image sharpness and temporal consistency. Participants of our user study +also preferred avatars generated by MoRF.",cs.CV,['cs.CV'] +SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation,Zhixuan Liu · Peter Schaldenbrand · Beverley-Claire Okogwu · Wenxuan Peng · Youngsik Yun · Andrew Hundt · Jihie Kim · Jean Oh,ariannaliu.github.io/SCoFT/,https://arxiv.org/abs/2401.08053,,2401.08053.pdf,SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation,"Accurate representation in media is known to improve the well-being of the +people who consume it. Generative image models trained on large web-crawled +datasets such as LAION are known to produce images with harmful stereotypes and +misrepresentations of cultures. We improve inclusive representation in +generated images by (1) engaging with communities to collect a culturally +representative dataset that we call the Cross-Cultural Understanding Benchmark +(CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT) method +that leverages the model's known biases to self-improve. SCoFT is designed to +prevent overfitting on small datasets, encode only high-level information from +the data, and shift the generated distribution away from misrepresentations +encoded in a pretrained model. Our user study conducted on 51 participants from +5 different countries based on their self-selected national cultural +affiliation shows that fine-tuning on CCUB consistently generates images with +higher cultural relevance and fewer stereotypes when compared to the Stable +Diffusion baseline, which is further improved with our SCoFT technique.",cs.CV,['cs.CV'] +Rendering Every Pixel for High-Fidelity Geometry in 3D GANs,Alex Trevithick · Matthew Chan · Towaki Takikawa · Umar Iqbal · Shalini De Mello · Manmohan Chandraker · Ravi Ramamoorthi · Koki Nagano, ,https://arxiv.org/abs/2401.02411,,2401.02411.pdf,What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs,"3D-aware Generative Adversarial Networks (GANs) have shown remarkable +progress in learning to generate multi-view-consistent images and 3D geometries +of scenes from collections of 2D images via neural volume rendering. Yet, the +significant memory and computational costs of dense sampling in volume +rendering have forced 3D GANs to adopt patch-based training or employ +low-resolution rendering with post-processing 2D super resolution, which +sacrifices multiview consistency and the quality of resolved geometry. +Consequently, 3D GANs have not yet been able to fully resolve the rich 3D +geometry present in 2D images. In this work, we propose techniques to scale +neural volume rendering to the much higher resolution of native 2D images, +thereby resolving fine-grained 3D geometry with unprecedented detail. Our +approach employs learning-based samplers for accelerating neural rendering for +3D GAN training using up to 5 times fewer depth samples. This enables us to +explicitly ""render every pixel"" of the full-resolution image during training +and inference without post-processing superresolution in 2D. Together with our +strategy to learn high-quality surface geometry, our method synthesizes +high-resolution 3D geometry and strictly view-consistent images while +maintaining image quality on par with baselines relying on post-processing +super resolution. We demonstrate state-of-the-art 3D gemetric quality on FFHQ +and AFHQ, setting a new standard for unsupervised learning of 3D shapes in 3D +GANs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']" +An Interactive Navigation Method with Effect-oriented Affordance,Xiaohan Wang · Yuehu LIU · Xinhang Song · Yuyi Liu · Sixian Zhang · Shuqiang Jiang, ,https://arxiv.org/abs/2310.08873,,2310.08873.pdf,Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models,"This paper proposes an interactive navigation framework by using large +language and vision-language models, allowing robots to navigate in +environments with traversable obstacles. We utilize the large language model +(GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an +action-aware costmap to perform effective path planning without fine-tuning. +With the large models, we can achieve an end-to-end system from textual +instructions like ""Can you pass through the curtains to deliver medicines to +me?"", to bounding boxes (e.g., curtains) with action-aware attributes. They can +be used to segment LiDAR point clouds into two parts: traversable and +untraversable parts, and then an action-aware costmap is constructed for +generating a feasible path. The pre-trained large models have great +generalization ability and do not require additional annotated data for +training, allowing fast deployment in the interactive navigation tasks. We +choose to use multiple traversable objects such as curtains and grasses for +verification by instructing the robot to traverse them. Besides, traversing +curtains in a medical scenario was tested. All experimental results +demonstrated the proposed framework's effectiveness and adaptability to diverse +environments.",cs.RO,"['cs.RO', 'cs.AI']" +Communication-Efficient Federated Learning with Accelerated Client Gradient,Geeho Kim · Jinkyu Kim · Bohyung Han, ,,https://openreview.net/forum?id=qwymfs6cKe,,,,,nan +InceptionNeXt: When Inception Meets ConvNeXt,Weihao Yu · Pan Zhou · Shuicheng Yan · Xinchao Wang,https://github.com/sail-sg/inceptionnext,,https://dblp.org/rec/journals/corr/abs-2303-16900,,,,,nan +MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation,Petru-Daniel Tudosiu · Yongxin Yang · Shifeng Zhang · Fei Chen · Steven McDonagh · Gerasimos Lampouras · Ignacio Iacobacci · Sarah Parisot,https://mulan-dataset.github.io/,https://arxiv.org/abs/2404.02790,,2404.02790.pdf,MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation,"Text-to-image generation has achieved astonishing results, yet precise +spatial controllability and prompt fidelity remain highly challenging. This +limitation is typically addressed through cumbersome prompt engineering, scene +layout conditioning, or image editing techniques which often require hand drawn +masks. Nonetheless, pre-existing works struggle to take advantage of the +natural instance-level compositionality of scenes due to the typically flat +nature of rasterized RGB output images. Towards adressing this challenge, we +introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of +RGB images as multilayer, instance-wise RGBA decompositions, and over 100K +instance images. To build MuLAn, we developed a training free pipeline which +decomposes a monocular RGB image into a stack of RGBA layers comprising of +background and isolated instances. We achieve this through the use of +pretrained general-purpose models, and by developing three modules: image +decomposition for instance discovery and extraction, instance completion to +reconstruct occluded areas, and image re-assembly. We use our pipeline to +create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image +decompositions in terms of style, composition and complexity. With MuLAn, we +provide the first photorealistic resource providing instance decomposition and +occlusion information for high quality images, opening up new avenues for +text-to-image generative AI research. With this, we aim to encourage the +development of novel generation and editing technology, in particular +layer-wise solutions. MuLAn data resources are available at +https://MuLAn-dataset.github.io/.",cs.CV,['cs.CV'] +Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning,Hao Jiang · Bingfeng Zhou · Yadong Mu, ,,https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ipr2.12998,,,,,nan +On the Scalability of Diffusion-based Text-to-Image Generation,Hao Li · Yang Zou · Ying Wang · Orchid Majumder · Yusheng Xie · R. Manmatha · Ashwin Swaminathan · Zhuowen Tu · Stefano Ermon · Stefano Soatto, ,https://arxiv.org/abs/2404.02883,,2404.02883.pdf,On the Scalability of Diffusion-based Text-to-Image Generation,"Scaling up model and data size has been quite successful for the evolution of +LLMs. However, the scaling law for the diffusion based text-to-image (T2I) +models is not fully explored. It is also unclear how to efficiently scale the +model for better performance at reduced cost. The different training settings +and expensive training cost make a fair model comparison extremely difficult. +In this work, we empirically study the scaling properties of diffusion based +T2I models by performing extensive and rigours ablations on scaling both +denoising backbones and training set, including training scaled UNet and +Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M +images. For model scaling, we find the location and amount of cross attention +distinguishes the performance of existing UNet designs. And increasing the +transformer blocks is more parameter-efficient for improving text-image +alignment than increasing channel numbers. We then identify an efficient UNet +variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data +scaling side, we show the quality and diversity of the training set matters +more than simply dataset size. Increasing caption density and diversity +improves text-image alignment performance and the learning efficiency. Finally, +we provide scaling functions to predict the text-image alignment performance as +functions of the scale of model size, compute and dataset size.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping,Junfeng Cheng · Tania Stathaki,https://github.com/J-F-Cheng/G-FARS-3DPartGrouping,https://arxiv.org/abs/2405.06828,,2405.06828.pdf,G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping,"This paper proposes a novel task named ""3D part grouping"". Suppose there is a +mixed set containing scattered parts from various shapes. This task requires +algorithms to find out every possible combination among all the parts. To +address this challenge, we propose the so called Gradient Field-based +Auto-Regressive Sampling framework (G-FARS) tailored specifically for the 3D +part grouping task. In our framework, we design a gradient-field-based +selection graph neural network (GNN) to learn the gradients of a log +conditional probability density in terms of part selection, where the condition +is the given mixed part set. This innovative approach, implemented through the +gradient-field-based selection GNN, effectively captures complex relationships +among all the parts in the input. Upon completion of the training process, our +framework becomes capable of autonomously grouping 3D parts by iteratively +selecting them from the mixed part set, leveraging the knowledge acquired by +the trained gradient-field-based selection GNN. Our code is available at: +https://github.com/J-F-Cheng/G-FARS-3DPartGrouping.",cs.CV,['cs.CV'] +Unsupervised Salient Instance Detection,Xin Tian · Ke Xu · Rynson W.H. Lau, ,https://arxiv.org/abs/2404.14759,,2404.14759.pdf,Unified Unsupervised Salient Object Detection via Knowledge Transfer,"Recently, unsupervised salient object detection (USOD) has gained increasing +attention due to its annotation-free nature. However, current methods mainly +focus on specific tasks such as RGB and RGB-D, neglecting the potential for +task migration. In this paper, we propose a unified USOD framework for generic +USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based +Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a +pre-trained deep network. This mechanism starts with easy samples and +progressively moves towards harder ones, to avoid initial interference caused +by hard samples. Afterwards, the obtained saliency cues are utilized to train a +saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR) +mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning +method is devised to transfer the acquired saliency knowledge, leveraging +shared knowledge to attain superior transferring performance on the target +tasks. Extensive experiments on five representative SOD tasks confirm the +effectiveness and feasibility of our proposed method. Code and supplement +materials are available at https://github.com/I2-Multimedia-Lab/A2S-v3.",cs.CV,['cs.CV'] +TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion,Yu-Ying Yeh · Jia-Bin Huang · Changil Kim · Lei Xiao · Thu Nguyen-Phuoc · Numair Khan · Cheng Zhang · Manmohan Chandraker · Carl Marshall · Zhao Dong · Zhengqin Li, ,,https://huggingface.co/papers/2401.09416,,,,,nan +RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models,Ozgur Kara · Bariscan Kurtkaya · Hidir Yesiltepe · James Rehg · Pinar Yanardag,https://rave-video.github.io/,https://arxiv.org/abs/2312.04524,,2312.04524.pdf,RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models,"Recent advancements in diffusion-based models have demonstrated significant +success in generating images from text. However, video editing models have not +yet reached the same level of visual quality and user control. To address this, +we introduce RAVE, a zero-shot video editing method that leverages pre-trained +text-to-image diffusion models without additional training. RAVE takes an input +video and a text prompt to produce high-quality videos while preserving the +original motion and semantic structure. It employs a novel noise shuffling +strategy, leveraging spatio-temporal interactions between frames, to produce +temporally consistent videos faster than existing methods. It is also efficient +in terms of memory requirements, allowing it to handle longer videos. RAVE is +capable of a wide range of edits, from local attribute modifications to shape +transformations. In order to demonstrate the versatility of RAVE, we create a +comprehensive video evaluation dataset ranging from object-focused scenes to +complex human activities like dancing and typing, and dynamic scenes featuring +swimming fish and boats. Our qualitative and quantitative experiments highlight +the effectiveness of RAVE in diverse video editing scenarios compared to +existing methods. Our code, dataset and videos can be found in +https://rave-video.github.io.",cs.CV,['cs.CV'] +CosmicMan: A Text-to-Image Foundation Model for Humans,Shikai Li · Jianglin Fu · Kaiyuan Liu · Wentao Wang · Kwan-Yee Lin · Wayne Wu, ,http://export.arxiv.org/abs/2404.01294,,2404.01294.pdf,CosmicMan: A Text-to-Image Foundation Model for Humans,"We present CosmicMan, a text-to-image foundation model specialized for +generating high-fidelity human images. Unlike current general-purpose +foundation models that are stuck in the dilemma of inferior quality and +text-image misalignment for humans, CosmicMan enables generating +photo-realistic human images with meticulous appearance, reasonable structure, +and precise text-image alignment with detailed dense descriptions. At the heart +of CosmicMan's success are the new reflections and perspectives on data and +models: (1) We found that data quality and a scalable data production flow are +essential for the final results from trained models. Hence, we propose a new +data production paradigm, Annotate Anyone, which serves as a perpetual data +flywheel to produce high-quality data with accurate yet cost-effective +annotations over time. Based on this, we constructed a large-scale dataset, +CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean +resolution of 1488x1255, and attached with precise text annotations deriving +from 115 Million attributes in diverse granularities. (2) We argue that a +text-to-image foundation model specialized for humans must be pragmatic -- easy +to integrate into down-streaming tasks while effective in producing +high-quality human images. Hence, we propose to model the relationship between +dense text descriptions and image pixels in a decomposed manner, and present +Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly +decomposes the cross-attention features in existing text-to-image diffusion +model, and enforces attention refocusing without adding extra modules. Through +Daring, we show that explicitly discretizing continuous text space into several +basic groups that align with human body structure is the key to tackling the +misalignment problem in a breeze.",cs.CV,['cs.CV'] +Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts,Jiayi Chen · Benteng Ma · Hengfei Cui · Kwang-Ting Cheng · Yong Xia, ,https://arxiv.org/abs/2312.02567,,2312.02567.pdf,Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts,"Federated learning facilitates the collaborative learning of a global model +across multiple distributed medical institutions without centralizing data. +Nevertheless, the expensive cost of annotation on local clients remains an +obstacle to effectively utilizing local data. To mitigate this issue, federated +active learning methods suggest leveraging local and global model predictions +to select a relatively small amount of informative local data for annotation. +However, existing methods mainly focus on all local data sampled from the same +domain, making them unreliable in realistic medical scenarios with domain +shifts among different clients. In this paper, we make the first attempt to +assess the informativeness of local data derived from diverse domains and +propose a novel methodology termed Federated Evidential Active Learning (FEAL) +to calibrate the data evaluation under domain shift. Specifically, we introduce +a Dirichlet prior distribution in both local and global models to treat the +prediction as a distribution over the probability simplex and capture both +aleatoric and epistemic uncertainties by using the Dirichlet-based evidential +model. Then we employ the epistemic uncertainty to calibrate the aleatoric +uncertainty. Afterward, we design a diversity relaxation strategy to reduce +data redundancy and maintain data diversity. Extensive experiments and analysis +on five real multi-center medical image datasets demonstrate the superiority of +FEAL over the state-of-the-art active learning methods in federated scenarios +with domain shifts. The code will be available at +https://github.com/JiayiChen815/FEAL.",cs.CV,['cs.CV'] +Riemannian Multinomial Logistics Regression for SPD Neural Networks,Ziheng Chen · Yue Song · Gaowen Liu · Ramana Kompella · Xiaojun Wu · Nicu Sebe,https://github.com/GitZH-Chen/SPDMLR.git,,https://openreview.net/forum?id=S0DUtGgkTM,,,,,nan +Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval,Young Kyun Jang · Donghyun Kim · Zihang Meng · Dat Huynh · Ser-Nam Lim,https://youngkyunjang.github.io/VDG_project/,https://arxiv.org/abs/2404.15516,,2404.15516.pdf,Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval,"Composed Image Retrieval (CIR) is a task that retrieves images similar to a +query, based on a provided textual modification. Current techniques rely on +supervised learning for CIR models using labeled triplets of the reference +image, text, target image. These specific triplets are not as commonly +available as simple image-text pairs, limiting the widespread use of CIR and +its scalability. On the other hand, zero-shot CIR can be relatively easily +trained with image-caption pairs without considering the image-to-image +relation, but this approach tends to yield lower accuracy. We propose a new +semi-supervised CIR approach where we search for a reference and its related +target images in auxiliary data and learn our large language model-based Visual +Delta Generator (VDG) to generate text describing the visual difference (i.e., +visual delta) between the two. VDG, equipped with fluent language knowledge and +being model agnostic, can generate pseudo triplets to boost the performance of +CIR models. Our approach significantly improves the existing supervised +learning approaches and achieves state-of-the-art results on the CIR +benchmarks.",cs.CV,"['cs.CV', 'cs.AI']" +LiDAR-based Person Re-identification,Wenxuan Guo · Zhiyu Pan · Yingping Liang · Ziheng Xi · Zhi Chen Zhong · Jianjiang Feng · Jie Zhou,https://github.com/GWxuan/ReID3D,https://arxiv.org/abs/2312.03033,,2312.03033.pdf,LiDAR-based Person Re-identification,"Camera-based person re-identification (ReID) systems have been widely applied +in the field of public security. However, cameras often lack the perception of +3D morphological information of human and are susceptible to various +limitations, such as inadequate illumination, complex background, and personal +privacy. In this paper, we propose a LiDAR-based ReID framework, ReID3D, that +utilizes pre-training strategy to retrieve features of 3D body shape and +introduces Graph-based Complementary Enhancement Encoder for extracting +comprehensive features. Due to the lack of LiDAR datasets, we build LReID, the +first LiDAR-based person ReID dataset, which is collected in several outdoor +scenes with variations in natural conditions. Additionally, we introduce +LReID-sync, a simulated pedestrian dataset designed for pre-training encoders +with tasks of point cloud completion and shape parameter learning. Extensive +experiments on LReID show that ReID3D achieves exceptional performance with a +rank-1 accuracy of 94.0, highlighting the significant potential of LiDAR in +addressing person ReID tasks. To the best of our knowledge, we are the first to +propose a solution for LiDAR-based ReID. The code and datasets will be released +soon.",cs.CV,['cs.CV'] +Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition,Zihan Wang · Siyang Song · Cheng Luo · Songhe Deng · Weicheng Xie · Linlin Shen,https://github.com/CVI-SZU/MDHR,https://arxiv.org/abs/2404.06443,,2404.06443.pdf,Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition,"Human facial action units (AUs) are mutually related in a hierarchical +manner, as not only they are associated with each other in both spatial and +temporal domains but also AUs located in the same/close facial regions show +stronger relationships than those of different facial regions. While none of +existing approach thoroughly model such hierarchical inter-dependencies among +AUs, this paper proposes to comprehensively model multi-scale AU-related +dynamic and hierarchical spatio-temporal relationship among AUs for their +occurrences recognition. Specifically, we first propose a novel multi-scale +temporal differencing network with an adaptive weighting block to explicitly +capture facial dynamics across frames at different spatial scales, which +specifically considers the heterogeneity of range and magnitude in different +AUs' activation. Then, a two-stage strategy is introduced to hierarchically +model the relationship among AUs based on their spatial distribution (i.e., +local and cross-region AU relationship modelling). Experimental results +achieved on BP4D and DISFA show that our approach is the new state-of-the-art +in the field of AU occurrence recognition. Our code is publicly available at +https://github.com/CVI-SZU/MDHR.",cs.CV,['cs.CV'] +Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration,Shihao Zhou · Duosheng Chen · Jinshan Pan · Jinglei Shi · Jufeng Yang,https://github.com/joshyZhou/AST,https://arxiv.org/abs/2312.06874,,2312.06874.pdf,Dozerformer: Sequence Adaptive Sparse Transformer for Multivariate Time Series Forecasting,"Transformers have achieved remarkable performance in multivariate time +series(MTS) forecasting due to their capability to capture long-term +dependencies. However, the canonical attention mechanism has two key +limitations: (1) its quadratic time complexity limits the sequence length, and +(2) it generates future values from the entire historical sequence. To address +this, we propose a Dozer Attention mechanism consisting of three sparse +components: (1) Local, each query exclusively attends to keys within a +localized window of neighboring time steps. (2) Stride, enables each query to +attend to keys at predefined intervals. (3) Vary, allows queries to selectively +attend to keys from a subset of the historical sequence. Notably, the size of +this subset dynamically expands as forecasting horizons extend. Those three +components are designed to capture essential attributes of MTS data, including +locality, seasonality, and global temporal dependencies. Additionally, we +present the Dozerformer Framework, incorporating the Dozer Attention mechanism +for the MTS forecasting task. We evaluated the proposed Dozerformer framework +with recent state-of-the-art methods on nine benchmark datasets and confirmed +its superior performance. The code will be released after the manuscript is +accepted.",cs.LG,"['cs.LG', 'cs.CL']" +Circuit Design and Efficient Simulation of Quantum Inner Product and Empirical Studies of Its Effect on Near-Term Hybrid Quantum-Classic Machine Learning,Hao Xiong · Yehui Tang · Xinyu Ye · Junchi Yan,https://github.com/ShawXh/qip_cvpr24,https://arxiv.org/abs/2310.03978,,2310.03978.pdf,Efficient Quantum Circuit Simulation by Tensor Network Methods on Modern GPUs,"Efficient simulation of quantum circuits has become indispensable with the +rapid development of quantum hardware. The primary simulation methods are based +on state vectors and tensor networks. As the number of qubits and quantum gates +grows larger in current quantum devices, traditional state-vector based quantum +circuit simulation methods prove inadequate due to the overwhelming size of the +Hilbert space and extensive entanglement. Consequently, brutal force tensor +network simulation algorithms become the only viable solution in such +scenarios. The two main challenges faced in tensor network simulation +algorithms are optimal contraction path finding and efficient execution on +modern computing devices, with the latter determines the actual efficiency. In +this study, we investigate the optimization of such tensor network simulations +on modern GPUs and propose general optimization strategies from two aspects: +computational efficiency and accuracy. Firstly, we propose to transform +critical Einstein summation operations into GEMM operations, leveraging the +specific features of tensor network simulations to amplify the efficiency of +GPUs. Secondly, by analyzing the data characteristics of quantum circuits, we +employ extended precision to ensure the accuracy of simulation results and +mixed precision to fully exploit the potential of GPUs, resulting in faster and +more precise simulations. Our numerical experiments demonstrate that our +approach can achieve a 3.96x reduction in verification time for random quantum +circuit samples in the 18-cycle case of Sycamore, with sustained performance +exceeding 21 TFLOPS on one A100. This method can be easily extended to the +20-cycle case, maintaining the same performance, accelerating by 12.5x compared +to the state-of-the-art CPU-based results and 4.48-6.78x compared to the +state-of-the-art GPU-based results reported in the literature.",quant-ph,"['quant-ph', 'cs.DC', 'physics.comp-ph']" +Image Sculpting: Precise Object Editing with 3D Geometry Control,Jiraphon Yenphraphai · Xichen Pan · Sainan Liu · Daniele Panozzo · Saining Xie,https://image-sculpting.github.io/,https://arxiv.org/abs/2401.01702,,2401.01702.pdf,Image Sculpting: Precise Object Editing with 3D Geometry Control,"We present Image Sculpting, a new framework for editing 2D images by +incorporating tools from 3D geometry and graphics. This approach differs +markedly from existing methods, which are confined to 2D spaces and typically +rely on textual instructions, leading to ambiguity and limited control. Image +Sculpting converts 2D objects into 3D, enabling direct interaction with their +3D geometry. Post-editing, these objects are re-rendered into 2D, merging into +the original image to produce high-fidelity results through a coarse-to-fine +enhancement process. The framework supports precise, quantifiable, and +physically-plausible editing options such as pose editing, rotation, +translation, 3D composition, carving, and serial addition. It marks an initial +step towards combining the creative freedom of generative models with the +precision of graphics pipelines.",cs.GR,"['cs.GR', 'cs.CV']" +Test-Time Domain Generalization for Face Anti-Spoofing,Qianyu Zhou · Ke-Yue Zhang · Taiping Yao · Xuequan Lu · Shouhong Ding · Lizhuang Ma, ,https://arxiv.org/abs/2403.19334,,2403.19334.pdf,Test-Time Domain Generalization for Face Anti-Spoofing,"Face Anti-Spoofing (FAS) is pivotal in safeguarding facial recognition +systems against presentation attacks. While domain generalization (DG) methods +have been developed to enhance FAS performance, they predominantly focus on +learning domain-invariant features during training, which may not guarantee +generalizability to unseen data that differs largely from the source +distributions. Our insight is that testing data can serve as a valuable +resource to enhance the generalizability beyond mere evaluation for DG FAS. In +this paper, we introduce a novel Test-Time Domain Generalization (TTDG) +framework for FAS, which leverages the testing data to boost the model's +generalizability. Our method, consisting of Test-Time Style Projection (TTSP) +and Diverse Style Shifts Simulation (DSSS), effectively projects the unseen +data to the seen domain space. In particular, we first introduce the innovative +TTSP to project the styles of the arbitrarily unseen samples of the testing +distribution to the known source space of the training distributions. We then +design the efficient DSSS to synthesize diverse style shifts via learnable +style bases with two specifically designed losses in a hyperspherical feature +space. Our method eliminates the need for model updates at the test time and +can be seamlessly integrated into not only the CNN but also ViT backbones. +Comprehensive experiments on widely used cross-domain FAS benchmarks +demonstrate our method's state-of-the-art performance and effectiveness.",cs.CV,['cs.CV'] +Towards Learning a Generalist Model for Embodied Navigation,Duo Zheng · Shijia Huang · Lin Zhao · Yiwu Zhong · Liwei Wang, ,https://arxiv.org/abs/2312.02010,,2312.02010.pdf,Towards Learning a Generalist Model for Embodied Navigation,"Building a generalist agent that can interact with the world is the +intriguing target of AI systems, thus spurring the research for embodied +navigation, where an agent is required to navigate according to instructions or +respond to queries. Despite the major progress attained, previous works +primarily focus on task-specific agents and lack generalizability to unseen +scenarios. Recently, LLMs have presented remarkable capabilities across various +fields, and provided a promising opportunity for embodied navigation. Drawing +on this, we propose the first generalist model for embodied navigation, +NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based +instruction. The schema-based instruction flexibly casts various tasks into +generation problems, thereby unifying a wide range of tasks. This approach +allows us to integrate diverse data sources from various datasets into the +training, equipping NaviLLM with a wide range of capabilities required by +embodied navigation. We conduct extensive experiments to evaluate the +performance and generalizability of our model. The experimental results +demonstrate that our unified model achieves state-of-the-art performance on +CVDN, SOON, and ScanQA. Specifically, it surpasses the previous +stats-of-the-art method by a significant margin of 29% in goal progress on +CVDN. Moreover, our model also demonstrates strong generalizability and +presents impressive results on unseen tasks, e.g., embodied question answering +and 3D captioning.",cs.CV,"['cs.CV', 'cs.AI']" +Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning,Tung Le · Khai Nguyen · Shanlin Sun · Nhat Ho · Xiaohui Xie, ,https://arxiv.org/abs/2403.01781v1,,2403.01781v1.pdf,Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning,"In the realm of computer vision and graphics, accurately establishing +correspondences between geometric 3D shapes is pivotal for applications like +object tracking, registration, texture transfer, and statistical shape +analysis. Moving beyond traditional hand-crafted and data-driven feature +learning methods, we incorporate spectral methods with deep learning, focusing +on functional maps (FMs) and optimal transport (OT). Traditional OT-based +approaches, often reliant on entropy regularization OT in learning-based +framework, face computational challenges due to their quadratic cost. Our key +contribution is to employ the sliced Wasserstein distance (SWD) for OT, which +is a valid fast optimal transport metric in an unsupervised shape matching +framework. This unsupervised framework integrates functional map regularizers +with a novel OT-based loss derived from SWD, enhancing feature alignment +between shapes treated as discrete probability measures. We also introduce an +adaptive refinement process utilizing entropy regularized OT, further refining +feature alignments for accurate point-to-point correspondences. Our method +demonstrates superior performance in non-rigid shape matching, including +near-isometric and non-isometric scenarios, and excels in downstream tasks like +segmentation transfer. The empirical results on diverse datasets highlight our +framework's effectiveness and generalization capabilities, setting new +standards in non-rigid shape matching with efficient OT metrics and an adaptive +refinement module.",cs.CV,"['cs.CV', 'cs.AI']" +NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images,Yufei Han · Heng Guo · Koki Fukai · Hiroaki Santo · Boxin Shi · Fumio Okura · Zhanyu Ma · Yunpeng Jia, ,,,,,,,nan +Towards Transferable Targeted 3D Adversarial Attack in the Physical World,Yao Huang · Yinpeng Dong · Shouwei Ruan · Xiao Yang · Hang Su · Xingxing Wei, ,https://arxiv.org/abs/2312.09558,,2312.09558.pdf,Towards Transferable Targeted 3D Adversarial Attack in the Physical World,"Compared with transferable untargeted attacks, transferable targeted +adversarial attacks could specify the misclassification categories of +adversarial samples, posing a greater threat to security-critical tasks. In the +meanwhile, 3D adversarial samples, due to their potential of multi-view +robustness, can more comprehensively identify weaknesses in existing deep +learning systems, possessing great application value. However, the field of +transferable targeted 3D adversarial attacks remains vacant. The goal of this +work is to develop a more effective technique that could generate transferable +targeted 3D adversarial examples, filling the gap in this field. To achieve +this goal, we design a novel framework named TT3D that could rapidly +reconstruct from few multi-view images into Transferable Targeted 3D textured +meshes. While existing mesh-based texture optimization methods compute +gradients in the high-dimensional mesh space and easily fall into local optima, +leading to unsatisfactory transferability and distinct distortions, TT3D +innovatively performs dual optimization towards both feature grid and +Multi-layer Perceptron (MLP) parameters in the grid-based NeRF space, which +significantly enhances black-box transferability while enjoying naturalness. +Experimental results show that TT3D not only exhibits superior cross-model +transferability but also maintains considerable adaptability across different +renders and vision tasks. More importantly, we produce 3D adversarial examples +with 3D printing techniques in the real world and verify their robust +performance under various scenarios.",cs.CV,['cs.CV'] +Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences,Axel Barroso-Laguna · Sowmya Munukutla · Victor Adrian Prisacariu · Eric Brachmann,https://nianticlabs.github.io/mickey/,https://arxiv.org/abs/2404.06337,,2404.06337.pdf,Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences,"Given two images, we can estimate the relative camera pose between them by +establishing image-to-image correspondences. Usually, correspondences are +2D-to-2D and the pose we estimate is defined only up to scale. Some +applications, aiming at instant augmented reality anywhere, require +scale-metric pose estimates, and hence, they rely on external depth estimators +to recover the scale. We present MicKey, a keypoint matching pipeline that is +able to predict metric correspondences in 3D camera space. By learning to match +3D coordinates across images, we are able to infer the metric relative pose +without depth measurements. Depth measurements are also not required for +training, nor are scene reconstructions or image overlap information. MicKey is +supervised only by pairs of images and their relative poses. MicKey achieves +state-of-the-art performance on the Map-Free Relocalisation benchmark while +requiring less supervision than competing approaches.",cs.CV,['cs.CV'] +3D LiDAR Mapping in Dynamic Environments using a 4D Implicit Neural Representation,Xingguang Zhong · Yue Pan · Cyrill Stachniss · Jens Behley,https://github.com/PRBonn/4dNDF,http://export.arxiv.org/abs/2405.03388,,2405.03388.pdf,3D LiDAR Mapping in Dynamic Environments Using a 4D Implicit Neural Representation,"Building accurate maps is a key building block to enable reliable +localization, planning, and navigation of autonomous vehicles. We propose a +novel approach for building accurate maps of dynamic environments utilizing a +sequence of LiDAR scans. To this end, we propose encoding the 4D scene into a +novel spatio-temporal implicit neural map representation by fitting a +time-dependent truncated signed distance function to each point. Using our +representation, we extract the static map by filtering the dynamic parts. Our +neural representation is based on sparse feature grids, a globally shared +decoder, and time-dependent basis functions, which we jointly optimize in an +unsupervised fashion. To learn this representation from a sequence of LiDAR +scans, we design a simple yet efficient loss function to supervise the map +optimization in a piecewise way. We evaluate our approach on various scenes +containing moving objects in terms of the reconstruction quality of static maps +and the segmentation of dynamic point clouds. The experimental results +demonstrate that our method is capable of removing the dynamic part of the +input point clouds while reconstructing accurate and complete 3D maps, +outperforming several state-of-the-art methods. Codes are available at: +https://github.com/PRBonn/4dNDF",cs.CV,"['cs.CV', 'cs.RO']" +Robust Emotion Recognition in Context Debiasing,Dingkang Yang · Kun Yang · Mingcheng Li · Shunli Wang · Shuaibing Wang · Lihua Zhang, ,https://arxiv.org/abs/2403.05963,,2403.05963.pdf,Robust Emotion Recognition in Context Debiasing,"Context-aware emotion recognition (CAER) has recently boosted the practical +applications of affective computing techniques in unconstrained environments. +Mainstream CAER methods invariably extract ensemble representations from +diverse contexts and subject-centred characteristics to perceive the target +person's emotional state. Despite advancements, the biggest challenge remains +due to context bias interference. The harmful bias forces the models to rely on +spurious correlations between background contexts and emotion labels in +likelihood estimation, causing severe performance bottlenecks and confounding +valuable context priors. In this paper, we propose a counterfactual emotion +inference (CLEF) framework to address the above issue. Specifically, we first +formulate a generalized causal graph to decouple the causal relationships among +the variables in CAER. Following the causal graph, CLEF introduces a +non-invasive context branch to capture the adverse direct effect caused by the +context bias. During the inference, we eliminate the direct context effect from +the total causal effect by comparing factual and counterfactual outcomes, +resulting in bias mitigation and robust prediction. As a model-agnostic +framework, CLEF can be readily integrated into existing methods, bringing +consistent performance gains.",cs.CV,"['cs.CV', 'cs.LG']" +Learning to Produce Semi-dense Correspondences for Visual Localization,Khang Truong Giang · Soohwan Song · Sungho Jo,https://github.com/TruongKhang/DeViLoc,https://arxiv.org/abs/2402.08359,,2402.08359.pdf,Learning to Produce Semi-dense Correspondences for Visual Localization,"This study addresses the challenge of performing visual localization in +demanding conditions such as night-time scenarios, adverse weather, and +seasonal changes. While many prior studies have focused on improving +image-matching performance to facilitate reliable dense keypoint matching +between images, existing methods often heavily rely on predefined feature +points on a reconstructed 3D model. Consequently, they tend to overlook +unobserved keypoints during the matching process. Therefore, dense keypoint +matches are not fully exploited, leading to a notable reduction in accuracy, +particularly in noisy scenes. To tackle this issue, we propose a novel +localization method that extracts reliable semi-dense 2D-3D matching points +based on dense keypoint matches. This approach involves regressing semi-dense +2D keypoints into 3D scene coordinates using a point inference network. The +network utilizes both geometric and visual cues to effectively infer 3D +coordinates for unobserved keypoints from the observed ones. The abundance of +matching information significantly enhances the accuracy of camera pose +estimation, even in scenarios involving noisy or sparse 3D models. +Comprehensive evaluations demonstrate that the proposed method outperforms +other methods in challenging scenes and achieves competitive results in +large-scale visual localization benchmarks. The code will be available.",cs.CV,['cs.CV'] +Distilling CLIP with Dual Guidance for Learning Discriminative Human Body Shape Representation,Feng Liu · Minchul Kim · Zhiyuan Ren · Xiaoming Liu, ,https://arxiv.org/abs/2307.12732,,,CLIP-KD: An Empirical Study of CLIP Model Distillation,"Contrastive Language-Image Pre-training (CLIP) has become a promising +language-supervised visual pre-training framework. This paper aims to distill +small CLIP models supervised by a large teacher CLIP model. We propose several +distillation strategies, including relation, feature, gradient and contrastive +paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We +show that a simple feature mimicry with Mean Squared Error loss works +surprisingly well. Moreover, interactive contrastive learning across teacher +and student encoders is also effective in performance improvement. We explain +that the success of CLIP-KD can be attributed to maximizing the feature +similarity between teacher and student. The unified method is applied to +distill several student models trained on CC3M+12M. CLIP-KD improves student +CLIP models consistently over zero-shot ImageNet classification and cross-modal +retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the +teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy +over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% +and 20.1\% margins, respectively. Our code is released on +https://github.com/winycg/CLIP-KD.",cs.CV,['cs.CV'] +From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation,Yiwei Bao · Feng Lu, ,https://arxiv.org/abs/2309.02165,,2309.02165.pdf,PCFGaze: Physics-Consistent Feature for Appearance-based Gaze Estimation,"Although recent deep learning based gaze estimation approaches have achieved +much improvement, we still know little about how gaze features are connected to +the physics of gaze. In this paper, we try to answer this question by analyzing +the gaze feature manifold. Our analysis revealed the insight that the geodesic +distance between gaze features is consistent with the gaze differences between +samples. According to this finding, we construct the Physics- Consistent +Feature (PCF) in an analytical way, which connects gaze feature to the physical +definition of gaze. We further propose the PCFGaze framework that directly +optimizes gaze feature space by the guidance of PCF. Experimental results +demonstrate that the proposed framework alleviates the overfitting problem and +significantly improves cross-domain gaze estimation accuracy without extra +training data. The insight of gaze feature has the potential to benefit other +regression tasks with physical meanings.",cs.CV,['cs.CV'] +Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception,Haoming Chen · Zhizhong Zhang · Yanyun Qu · Ruixin Zhang · Xin Tan · Yuan Xie, ,https://arxiv.org/abs/2405.07201,,2405.07201.pdf,Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception,"An effective pre-training framework with universal 3D representations is +extremely desired in perceiving large-scale dynamic scenes. However, +establishing such an ideal framework that is both task-generic and +label-efficient poses a challenge in unifying the representation of the same +primitive across diverse scenes. The current contrastive 3D pre-training +methods typically follow a frame-level consistency, which focuses on the 2D-3D +relationships in each detached image. Such inconsiderate consistency greatly +hampers the promising path of reaching an universal pre-training framework: (1) +The cross-scene semantic self-conflict, i.e., the intense collision between +primitive segments of the same semantics from different scenes; (2) Lacking a +globally unified bond that pushes the cross-scene semantic consistency into 3D +representation learning. To address above challenges, we propose a CSC +framework that puts a scene-level semantic consistency in the heart, bridging +the connection of the similar semantic segments across various scenes. To +achieve this goal, we combine the coherent semantic cues provided by the vision +foundation model and the knowledge-rich cross-scene prototypes derived from the +complementary multi-modality information. These allow us to train a universal +3D pre-training model that facilitates various downstream tasks with less +fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA +pre-training approaches in semantic segmentation (+1.4% mIoU), object detection +(+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D +network on nuScenes. Code is released at https://github.com/chenhaomingbob/CSC, +hoping to inspire future research.",cs.CV,['cs.CV'] +Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation,Dongliang Cao · Marvin Eisenberger · Nafie El Amrani · Daniel Cremers · Florian Bernard, ,https://web3.arxiv.org/abs/2402.18920,,2402.18920.pdf,Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation,"Although 3D shape matching and interpolation are highly interrelated, they +are often studied separately and applied sequentially to relate different 3D +shapes, thus resulting in sub-optimal performance. In this work we present a +unified framework to predict both point-wise correspondences and shape +interpolation between 3D shapes. To this end, we combine the deep functional +map framework with classical surface deformation models to map shapes in both +spectral and spatial domains. On the one hand, by incorporating spatial maps, +our method obtains more accurate and smooth point-wise correspondences compared +to previous functional map methods for shape matching. On the other hand, by +introducing spectral maps, our method gets rid of commonly used but +computationally expensive geodesic distance constraints that are only valid for +near-isometric shape deformations. Furthermore, we propose a novel test-time +adaptation scheme to capture both pose-dominant and shape-dominant +deformations. Using different challenging datasets, we demonstrate that our +method outperforms previous state-of-the-art methods for both shape matching +and interpolation, even compared to supervised approaches.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CG']" +SinSR: Diffusion-Based Image Super-Resolution in a Single Step,Yufei Wang · Wenhan Yang · Xinyuan Chen · Yaohui Wang · Lanqing Guo · Lap-Pui Chau · Ziwei Liu · Yu Qiao · Alex C. Kot · Bihan Wen, ,https://arxiv.org/abs/2311.14760,,2311.14760.pdf,SinSR: Diffusion-Based Image Super-Resolution in a Single Step,"While super-resolution (SR) methods based on diffusion models exhibit +promising results, their practical application is hindered by the substantial +number of required inference steps. Recent methods utilize degraded images in +the initial state, thereby shortening the Markov chain. Nevertheless, these +solutions either rely on a precise formulation of the degradation process or +still necessitate a relatively lengthy generation path (e.g., 15 iterations). +To enhance inference speed, we propose a simple yet effective method for +achieving single-step SR generation, named SinSR. Specifically, we first derive +a deterministic sampling process from the most recent state-of-the-art (SOTA) +method for accelerating diffusion-based SR. This allows the mapping between the +input random noise and the generated high-resolution image to be obtained in a +reduced and acceptable number of inference steps during training. We show that +this deterministic mapping can be distilled into a student model that performs +SR within only one inference step. Additionally, we propose a novel +consistency-preserving loss to simultaneously leverage the ground-truth image +during the distillation process, ensuring that the performance of the student +model is not solely bound by the feature manifold of the teacher model, +resulting in further performance improvement. Extensive experiments conducted +on synthetic and real-world datasets demonstrate that the proposed method can +achieve comparable or even superior performance compared to both previous SOTA +methods and the teacher model, in just one sampling step, resulting in a +remarkable up to x10 speedup for inference. Our code will be released at +https://github.com/wyf0912/SinSR",cs.CV,['cs.CV'] +SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling,Juhee Lee · Jewon Kang, ,https://arxiv.org/abs/2402.03161,,2402.03161.pdf,Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization,"In light of recent advances in multimodal Large Language Models (LLMs), there +is increasing attention to scaling them from image-text data to more +informative real-world videos. Compared to static images, video poses unique +challenges for effective large-scale pre-training due to the modeling of its +spatiotemporal dynamics. In this paper, we address such limitations in +video-language pre-training with an efficient video decomposition that +represents each video as keyframes and temporal motions. These are then adapted +to an LLM using well-designed tokenizers that discretize visual and temporal +information as a few tokens, thus enabling unified generative pre-training of +videos, images, and text. At inference, the generated tokens from the LLM are +carefully recovered to the original continuous pixel space to create various +video content. Our proposed framework is both capable of comprehending and +generating image and video content, as demonstrated by its competitive +performance across 13 multimodal benchmarks in image and video understanding +and generation. Our code and models are available at +https://video-lavit.github.io.",cs.CV,"['cs.CV', 'cs.CL']" +Quantifying Task Priority for Multi-Task Optimization,Wooseong Jeong · Kuk-Jin Yoon, ,https://arxiv.org/abs/2403.16162,,2403.16162.pdf,Multi-Task Learning with Multi-Task Optimization,"Multi-task learning solves multiple correlated tasks. However, conflicts may +exist between them. In such circumstances, a single solution can rarely +optimize all the tasks, leading to performance trade-offs. To arrive at a set +of optimized yet well-distributed models that collectively embody different +trade-offs in one algorithmic pass, this paper proposes to view Pareto +multi-task learning through the lens of multi-task optimization. Multi-task +learning is first cast as a multi-objective optimization problem, which is then +decomposed into a diverse set of unconstrained scalar-valued subproblems. These +subproblems are solved jointly using a novel multi-task gradient descent +method, whose uniqueness lies in the iterative transfer of model parameters +among the subproblems during the course of optimization. A theorem proving +faster convergence through the inclusion of such transfers is presented. We +investigate the proposed multi-task learning with multi-task optimization for +solving various problem settings including image classification, scene +understanding, and multi-target regression. Comprehensive experiments confirm +that the proposed method significantly advances the state-of-the-art in +discovering sets of Pareto-optimized models. Notably, on the large image +dataset we tested on, namely NYUv2, the hypervolume convergence achieved by our +method was found to be nearly two times faster than the next-best among the +state-of-the-art.",cs.AI,['cs.AI'] +Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding,Le Zhang · Rabiul Awal · Aishwarya Agrawal,https://github.com/lezhang7/Enhance-FineGrained,https://arxiv.org/abs/2306.08832,,2306.08832.pdf,Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding,"Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text +comprehension abilities, facilitating advances in several downstream tasks such +as zero-shot image classification, image-text retrieval, and text-to-image +generation. However, the compositional reasoning abilities of existing VLMs +remains subpar. The root of this limitation lies in the inadequate alignment +between the images and captions in the pretraining datasets. Additionally, the +current contrastive learning objective fails to focus on fine-grained grounding +components like relations, actions, and attributes, resulting in ""bag-of-words"" +representations. We introduce a simple and effective method to improve +compositional reasoning in VLMs. Our method better leverages available datasets +by refining and expanding the standard image-text contrastive learning +framework. Our approach does not require specific annotations and does not +incur extra parameters. When integrated with CLIP, our technique yields notable +improvement over state-of-the-art baselines across five vision-language +compositional benchmarks. We open-source our code at +https://github.com/lezhang7/Enhance-FineGrained.",cs.CV,['cs.CV'] +CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition,Feng Lu · Xiangyuan Lan · Lijun Zhang · Dongmei Jiang · Yaowei Wang · Chun Yuan, ,https://arxiv.org/abs/2402.19231,,2402.19231.pdf,CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition,"Over the past decade, most methods in visual place recognition (VPR) have +used neural networks to produce feature representations. These networks +typically produce a global representation of a place image using only this +image itself and neglect the cross-image variations (e.g. viewpoint and +illumination), which limits their robustness in challenging scenes. In this +paper, we propose a robust global representation method with cross-image +correlation awareness for VPR, named CricaVPR. Our method uses the attention +mechanism to correlate multiple images within a batch. These images can be +taken in the same place with different conditions or viewpoints, or even +captured from different places. Therefore, our method can utilize the +cross-image variations as a cue to guide the representation learning, which +ensures more robust features are produced. To further facilitate the +robustness, we propose a multi-scale convolution-enhanced adaptation method to +adapt pre-trained visual foundation models to the VPR task, which introduces +the multi-scale local information to further enhance the cross-image +correlation-aware representation. Experimental results show that our method +outperforms state-of-the-art methods by a large margin with significantly less +training time. The code is released at https://github.com/Lu-Feng/CricaVPR.",cs.CV,"['cs.CV', 'cs.RO']" +Dual Prior Unfolding for Snapshot Compressive Imaging,Jiancheng Zhang · Haijin Zeng · Jiezhang Cao · Yongyong Chen · Dengxiu Yu · Yinping Zhao,https://github.com/ZhangJC-2k/DPU,,https://link.springer.com/article/10.1007/s11263-023-01844-4,,,,,nan +Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships,Rangel Daroya · Aaron Sun · Subhransu Maji,https://github.com/cvl-umass/task2box,https://arxiv.org/abs/2403.17173,,2403.17173.pdf,Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships,"Modeling and visualizing relationships between tasks or datasets is an +important step towards solving various meta-tasks such as dataset discovery, +multi-tasking, and transfer learning. However, many relationships, such as +containment and transferability, are naturally asymmetric and current +approaches for representation and visualization (e.g., t-SNE) do not readily +support this. We propose Task2Box, an approach to represent tasks using box +embeddings -- axis-aligned hyperrectangles in low dimensional spaces -- that +can capture asymmetric relationships between them through volumetric overlaps. +We show that Task2Box accurately predicts unseen hierarchical relationships +between nodes in ImageNet and iNaturalist datasets, as well as transferability +between tasks in the Taskonomy benchmark. We also show that box embeddings +estimated from task representations (e.g., CLIP, Task2Vec, or attribute based) +can be used to predict relationships between unseen tasks more accurately than +classifiers trained on the same representations, as well as handcrafted +asymmetric distances (e.g., KL divergence). This suggests that low-dimensional +box embeddings can effectively capture these task relationships and have the +added advantage of being interpretable. We use the approach to visualize +relationships among publicly available image classification datasets on popular +dataset hosting platform called Hugging Face.",cs.CV,['cs.CV'] +Shadow Generation for Composite Image Using Diffusion Model,Qingyang Liu · Junqi You · Jian-Ting Wang · Xinhao Tao · Bo Zhang · Li Niu, ,https://arxiv.org/abs/2403.15234,,2403.15234.pdf,Shadow Generation for Composite Image Using Diffusion model,"In the realm of image composition, generating realistic shadow for the +inserted foreground remains a formidable challenge. Previous works have +developed image-to-image translation models which are trained on paired +training data. However, they are struggling to generate shadows with accurate +shapes and intensities, hindered by data scarcity and inherent task complexity. +In this paper, we resort to foundation model with rich prior knowledge of +natural shadow images. Specifically, we first adapt ControlNet to our task and +then propose intensity modulation modules to improve the shadow intensity. +Moreover, we extend the small-scale DESOBA dataset to DESOBAv2 using a novel +data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2 +datasets as well as real composite images demonstrate the superior capability +of our model for shadow generation task. The dataset, code, and model are +released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2.",cs.CV,['cs.CV'] +NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild,Weining Ren · Zihan Zhu · Boyang Sun · Jiaqi Chen · Marc Pollefeys · Songyou Peng,https://rwn17.github.io/nerf-on-the-go,https://arxiv.org/abs/2405.18715,,2405.18715.pdf,NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild,"Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing +photorealistic views from multi-view images of static scenes, but face +challenges in dynamic, real-world environments with distractors like moving +objects, shadows, and lighting changes. Existing methods manage controlled +environments and low occlusion ratios but fall short in render quality, +especially under high occlusion scenarios. In this paper, we introduce NeRF +On-the-go, a simple yet effective approach that enables the robust synthesis of +novel views in complex, in-the-wild scenes from only casually captured image +sequences. Delving into uncertainty, our method not only efficiently eliminates +distractors, even when they are predominant in captures, but also achieves a +notably faster convergence speed. Through comprehensive experiments on various +scenes, our method demonstrates a significant improvement over state-of-the-art +techniques. This advancement opens new avenues for NeRF in diverse and dynamic +real-world applications.",cs.CV,['cs.CV'] +Improved Baselines with Visual Instruction Tuning,Haotian Liu · Chunyuan Li · Yuheng Li · Yong Jae Lee,https://llava-vl.github.io,https://arxiv.org/abs/2310.03744,,2310.03744.pdf,Improved Baselines with Visual Instruction Tuning,"Large multimodal models (LMM) have recently shown encouraging progress with +visual instruction tuning. In this note, we show that the fully-connected +vision-language cross-modal connector in LLaVA is surprisingly powerful and +data-efficient. With simple modifications to LLaVA, namely, using +CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA +data with simple response formatting prompts, we establish stronger baselines +that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint +uses merely 1.2M publicly available data, and finishes full training in ~1 day +on a single 8-A100 node. We hope this can make state-of-the-art LMM research +more accessible. Code and model will be publicly available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +ParamISP: Learned Forward and Inverse ISPs using Camera Parameters,Woohyeok Kim · Geonu Kim · Junyong Lee · Seungyong Lee · Seung-Hwan Baek · Sunghyun Cho,https://woo525.github.io/ParamISP/,https://arxiv.org/abs/2312.13313,,2312.13313.pdf,ParamISP: Learned Forward and Inverse ISPs using Camera Parameters,"RAW images are rarely shared mainly due to its excessive data size compared +to their sRGB counterparts obtained by camera ISPs. Learning the forward and +inverse processes of camera ISPs has been recently demonstrated, enabling +physically-meaningful RAW-level image processing on input sRGB images. However, +existing learning-based ISP methods fail to handle the large variations in the +ISP processes with respect to camera parameters such as ISO and exposure time, +and have limitations when used for various applications. In this paper, we +propose ParamISP, a learning-based method for forward and inverse conversion +between sRGB and RAW images, that adopts a novel neural-network module to +utilize camera parameters, which is dubbed as ParamNet. Given the camera +parameters provided in the EXIF data, ParamNet converts them into a feature +vector to control the ISP networks. Extensive experiments demonstrate that +ParamISP achieve superior RAW and sRGB reconstruction results compared to +previous methods and it can be effectively used for a variety of applications +such as deblurring dataset synthesis, raw deblurring, HDR reconstruction, and +camera-to-camera transfer.",eess.IV,"['eess.IV', 'cs.CV']" +ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts,Mu Cai · Haotian Liu · Siva Mustikovela · Gregory P. Meyer · Yuning Chai · Dennis Park · Yong Jae Lee,https://vip-llava.github.io/,https://arxiv.org/abs/2312.00784,,2312.00784.pdf,ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts,"While existing large vision-language multimodal models focus on whole image +understanding, there is a prominent gap in achieving region-specific +comprehension. Current approaches that use textual coordinates or spatial +encodings often fail to provide a user-friendly interface for visual prompting. +To address this challenge, we introduce a novel multimodal model capable of +decoding arbitrary visual prompts. This allows users to intuitively mark images +and interact with the model using natural cues like a ""red bounding box"" or +""pointed arrow"". Our simple design directly overlays visual markers onto the +RGB image, eliminating the need for complex region encodings, yet achieves +state-of-the-art performance on region-understanding tasks like Visual7W, +PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present +ViP-Bench, a comprehensive benchmark to assess the capability of models in +understanding visual prompts across multiple dimensions, enabling future +research in this domain. Code, data, and model are publicly available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +Compact 3D Gaussian Representation for Radiance Field,Joo Chan Lee · Daniel Rho · Xiangyu Sun · Jong Hwan Ko · Eunbyung Park,https://maincold2.github.io/c3dgs/,https://arxiv.org/abs/2311.13681,,2311.13681.pdf,Compact 3D Gaussian Representation for Radiance Field,"Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in +capturing complex 3D scenes with high fidelity. However, one persistent +challenge that hinders the widespread adoption of NeRFs is the computational +bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian +splatting (3DGS) has recently emerged as an alternative representation that +leverages a 3D Gaussisan-based representation and adopts the rasterization +pipeline to render the images rather than volumetric rendering, achieving very +fast rendering speed and promising image quality. However, a significant +drawback arises as 3DGS entails a substantial number of 3D Gaussians to +maintain the high fidelity of the rendered images, which requires a large +amount of memory and storage. To address this critical issue, we place a +specific emphasis on two key objectives: reducing the number of Gaussian points +without sacrificing performance and compressing the Gaussian attributes, such +as view-dependent color and covariance. To this end, we propose a learnable +mask strategy that significantly reduces the number of Gaussians while +preserving high performance. In addition, we propose a compact but effective +representation of view-dependent color by employing a grid-based neural field +rather than relying on spherical harmonics. Finally, we learn codebooks to +compactly represent the geometric attributes of Gaussian by vector +quantization. With model compression techniques such as quantization and +entropy coding, we consistently show over 25$\times$ reduced storage and +enhanced rendering speed, while maintaining the quality of the scene +representation, compared to 3DGS. Our work provides a comprehensive framework +for 3D scene representation, achieving high performance, fast training, +compactness, and real-time rendering. Our project page is available at +https://maincold2.github.io/c3dgs/.",cs.CV,"['cs.CV', 'cs.GR']" +Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology,Oren Kraus · Kian Kenyon-Dean · Saber Saberian · Maryam Fallah · Peter McLean · Jess Leung · Vasudev Sharma · Ayla Khan · Jia Balakrishnan · Safiye Celik · Dominique Beaini · Maciej Sypetkowski · Chi Cheng · Kristen Morse · Maureen Makes · Ben Mabey · Berton Earnshaw, ,https://arxiv.org/abs/2404.10242,,2404.10242.pdf,Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology,"Featurizing microscopy images for use in biological research remains a +significant challenge, especially for large-scale experiments spanning millions +of images. This work explores the scaling properties of weakly supervised +classifiers and self-supervised masked autoencoders (MAEs) when training with +increasingly larger model backbones and microscopy datasets. Our results show +that ViT-based MAEs outperform weakly supervised classifiers on a variety of +tasks, achieving as much as a 11.5% relative improvement when recalling known +biological relationships curated from public databases. Additionally, we +develop a new channel-agnostic MAE architecture (CA-MAE) that allows for +inputting images of different numbers and orders of channels at inference time. +We demonstrate that CA-MAEs effectively generalize by inferring and evaluating +on a microscopy image dataset (JUMP-CP) generated under different experimental +conditions with a different channel structure than our pretraining data +(RPI-93M). Our findings motivate continued research into scaling +self-supervised learning on microscopy data in order to create powerful +foundation models of cellular biology that have the potential to catalyze +advancements in drug discovery and beyond.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis,Yanzuo Lu · Manlin Zhang · Jinhua Ma · Xiaohua Xie · Jianhuang Lai,https://github.com/YanzuoLu/CFLD,https://arxiv.org/abs/2402.18078,,2402.18078.pdf,Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis,"Diffusion model is a promising approach to image generation and has been +employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive +performance. While existing methods simply align the person appearance to the +target pose, they are prone to overfitting due to the lack of a high-level +semantic understanding on the source person image. In this paper, we propose a +novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence +of image-caption pairs and textual prompts, we develop a novel training +paradigm purely based on images to control the generation process of a +pre-trained text-to-image diffusion model. A perception-refined decoder is +designed to progressively refine a set of learnable queries and extract +semantic understanding of person images as a coarse-grained prompt. This allows +for the decoupling of fine-grained appearance and pose information controls at +different stages, and thus circumventing the potential overfitting problem. To +generate more realistic texture details, a hybrid-granularity attention module +is proposed to encode multi-scale fine-grained appearance features as bias +terms to augment the coarse-grained prompt. Both quantitative and qualitative +experimental results on the DeepFashion benchmark demonstrate the superiority +of our method over the state of the arts for PGPIS. Code is available at +https://github.com/YanzuoLu/CFLD.",cs.CV,['cs.CV'] +SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training,WU Sitong · Haoru Tan · Zhuotao Tian · Yukang Chen · Xiaojuan Qi · Jiaya Jia, ,https://arxiv.org/abs/2405.10286,,,FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models,"Despite noise and caption quality having been acknowledged as important +factors impacting vision-language contrastive pre-training, in this paper, we +show that the full potential of improving the training process by addressing +such issues is yet to be realized. Specifically, we firstly study and analyze +two issues affecting training: incorrect assignment of negative pairs, and low +caption quality and diversity. Then, we devise effective solutions for +addressing both problems, which essentially require training with multiple true +positive pairs. Finally, we propose training with sigmoid loss to address such +a requirement. We show very large gains over the current state-of-the-art for +both image recognition ($\sim +6\%$ on average over 11 datasets) and image +retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).",cs.CV,"['cs.CV', 'cs.AI']" +Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps,Octave Mariotti · Oisin Mac Aodha · Hakan Bilen, ,https://arxiv.org/abs/2312.13216,,2312.13216.pdf,Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps,"Recent progress in self-supervised representation learning has resulted in +models that are capable of extracting image features that are not only +effective at encoding image level, but also pixel-level, semantics. These +features have been shown to be effective for dense visual semantic +correspondence estimation, even outperforming fully-supervised methods. +Nevertheless, current self-supervised approaches still fail in the presence of +challenging image characteristics such as symmetries and repeated parts. To +address these limitations, we propose a new approach for semantic +correspondence estimation that supplements discriminative self-supervised +features with 3D understanding via a weak geometric spherical prior. Compared +to more involved 3D pipelines, our model only requires weak viewpoint +information, and the simplicity of our spherical representation enables us to +inject informative geometric priors into the model during training. We propose +a new evaluation metric that better accounts for repeated part and +symmetry-induced mistakes. We present results on the challenging SPair-71k +dataset, where we show that our approach demonstrates is capable of +distinguishing between symmetric views and repeated parts across many object +categories, and also demonstrate that we can generalize to unseen classes on +the AwA dataset.",cs.CV,['cs.CV'] +XFeat: Accelerated Features for Lightweight Image Matching,Guilherme Potje · Felipe Cadar · André Araujo · Renato Martins · Erickson R. Nascimento,https://verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24,https://arxiv.org/abs/2404.19174,,2404.19174.pdf,XFeat: Accelerated Features for Lightweight Image Matching,"We introduce a lightweight and accurate architecture for resource-efficient +visual correspondence. Our method, dubbed XFeat (Accelerated Features), +revisits fundamental design choices in convolutional neural networks for +detecting, extracting, and matching local features. Our new model satisfies a +critical need for fast and robust algorithms suitable to resource-limited +devices. In particular, accurate image matching requires sufficiently large +image resolutions - for this reason, we keep the resolution as large as +possible while limiting the number of channels in the network. Besides, our +model is designed to offer the choice of matching at the sparse or semi-dense +levels, each of which may be more suitable for different downstream +applications, such as visual navigation and augmented reality. Our model is the +first to offer semi-dense matching efficiently, leveraging a novel match +refinement module that relies on coarse local descriptors. XFeat is versatile +and hardware-independent, surpassing current deep learning-based local features +in speed (up to 5x faster) with comparable or better accuracy, proven in pose +estimation and visual localization. We showcase it running in real-time on an +inexpensive laptop CPU without specialized hardware optimizations. Code and +weights are available at www.verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24.",cs.CV,['cs.CV'] +Towards Realistic Scene Generation with LiDAR Diffusion Models,Haoxi Ran · Vitor Guizilini · Yue Wang,https://lidar-diffusion.github.io/,https://arxiv.org/abs/2404.00815,,2404.00815.pdf,Towards Realistic Scene Generation with LiDAR Diffusion Models,"Diffusion models (DMs) excel in photo-realistic image synthesis, but their +adaptation to LiDAR scene generation poses a substantial hurdle. This is +primarily because DMs operating in the point space struggle to preserve the +curve-like patterns and 3D geometry of LiDAR scenes, which consumes much of +their representation power. In this paper, we propose LiDAR Diffusion Models +(LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to +capture the realism of LiDAR scenes by incorporating geometric priors into the +learning pipeline. Our method targets three major desiderata: pattern realism, +geometry realism, and object realism. Specifically, we introduce curve-wise +compression to simulate real-world LiDAR patterns, point-wise coordinate +supervision to learn scene geometry, and patch-wise encoding for a full 3D +object context. With these three core designs, our method achieves competitive +performance on unconditional LiDAR generation in 64-beam scenario and state of +the art on conditional LiDAR generation, while maintaining high efficiency +compared to point-based DMs (up to 107$\times$ faster). Furthermore, by +compressing LiDAR scenes into a latent space, we enable the controllability of +DMs with various conditions such as semantic maps, camera views, and text +prompts.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models,Muyang Li · Tianle Cai · Jiaxin Cao · Qinsheng Zhang · Han Cai · Junjie Bai · Yangqing Jia · Kai Li · Song Han, ,https://arxiv.org/abs/2402.19481,,2402.19481.pdf,DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models,"Diffusion models have achieved great success in synthesizing high-quality +images. However, generating high-resolution images with diffusion models is +still challenging due to the enormous computational costs, resulting in a +prohibitive latency for interactive applications. In this paper, we propose +DistriFusion to tackle this problem by leveraging parallelism across multiple +GPUs. Our method splits the model input into multiple patches and assigns each +patch to a GPU. However, naively implementing such an algorithm breaks the +interaction between patches and loses fidelity, while incorporating such an +interaction will incur tremendous communication overhead. To overcome this +dilemma, we observe the high similarity between the input from adjacent +diffusion steps and propose displaced patch parallelism, which takes advantage +of the sequential nature of the diffusion process by reusing the pre-computed +feature maps from the previous timestep to provide context for the current +step. Therefore, our method supports asynchronous communication, which can be +pipelined by computation. Extensive experiments show that our method can be +applied to recent Stable Diffusion XL with no quality degradation and achieve +up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is +publicly available at https://github.com/mit-han-lab/distrifuser.",cs.CV,['cs.CV'] +Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use,Imad Eddine Toubal · Aditya Avinash · Neil Alldrin · Jan Dlabal · Wenlei Zhou · Enming Luo · Otilia Stretcu · Hao Xiong · Chun-Ta Lu · Howard Zhou · Ranjay Krishna · Ariel Fuxman · Tom Duerig, ,https://arxiv.org/abs/2403.02626,,2403.02626.pdf,Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use,"From content moderation to wildlife conservation, the number of applications +that require models to recognize nuanced or subjective visual concepts is +growing. Traditionally, developing classifiers for such concepts requires +substantial manual effort measured in hours, days, or even months to identify +and annotate data needed for training. Even with recently proposed Agile +Modeling techniques, which enable rapid bootstrapping of image classifiers, +users are still required to spend 30 minutes or more of monotonous, repetitive +data labeling just to train a single classifier. Drawing on Fiske's Cognitive +Miser theory, we propose a new framework that alleviates manual effort by +replacing human labeling with natural language interactions, reducing the total +effort required to define a concept by an order of magnitude: from labeling +2,000 images to only 100 plus some natural language interactions. Our framework +leverages recent advances in foundation models, both large language models and +vision-language models, to carve out the concept space through conversation and +by automatically labeling training data points. Most importantly, our framework +eliminates the need for crowd-sourced annotations. Moreover, our framework +ultimately produces lightweight classification models that are deployable in +cost-sensitive scenarios. Across 15 subjective concepts and across 2 public +image classification datasets, our trained models outperform traditional Agile +Modeling as well as state-of-the-art zero-shot classification models like +ALIGN, CLIP, CuPL, and large visual question-answering models like PaLI-X.",cs.CV,"['cs.CV', 'cs.LG']" +Multi-Task Dense Prediction via Mixture of Low-Rank Experts,Yuqi Yang · Peng-Tao Jiang · Qibin Hou · Hao Zhang · Jinwei Chen · Bo Li, ,https://arxiv.org/abs/2403.17749,,2403.17749.pdf,Multi-Task Dense Prediction via Mixture of Low-Rank Experts,"Previous multi-task dense prediction methods based on the Mixture of Experts +(MoE) have received great performance but they neglect the importance of +explicitly modeling the global relations among all tasks. In this paper, we +present a novel decoder-focused method for multi-task dense prediction, called +Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, +MLoRE adds a generic convolution path to the original MoE structure, where each +task feature can go through this path for explicit parameter sharing. +Furthermore, to control the parameters and computational cost brought by the +increase in the number of experts, we take inspiration from LoRA and propose to +leverage the low-rank format of a vanilla convolution in the expert network. +Since the low-rank experts have fewer parameters and can be dynamically +parameterized into the generic convolution, the parameters and computational +cost do not change much with the increase of experts. Benefiting from this +design, we increase the number of experts and its reception field to enlarge +the representation capacity, facilitating multiple dense tasks learning in a +unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 +benchmarks show that our MLoRE achieves superior performance compared to +previous state-of-the-art methods on all metrics. Our code is available at +https://github.com/YuqiYang213/MLoRE.",cs.CV,['cs.CV'] +Traffic Scene Parsing through the TSP6K Dataset,Peng-Tao Jiang · Yuqi Yang · Yang Cao · Qibin Hou · Ming-Ming Cheng · Chunhua Shen, ,https://ar5iv.labs.arxiv.org/html/2303.02835,,2303.02835.pdf,Traffic Scene Parsing through the TSP6K Dataset,"Traffic scene perception in computer vision is a critically important task to +achieve intelligent cities. To date, most existing datasets focus on autonomous +driving scenes. We observe that the models trained on those driving datasets +often yield unsatisfactory results on traffic monitoring scenes. However, +little effort has been put into improving the traffic monitoring scene +understanding, mainly due to the lack of specific datasets. To fill this gap, +we introduce a specialized traffic monitoring dataset, termed TSP6K, containing +images from the traffic monitoring scenario, with high-quality pixel-level and +instance-level annotations. The TSP6K dataset captures more crowded traffic +scenes with several times more traffic participants than the existing driving +scenes. We perform a detailed analysis of the dataset and comprehensively +evaluate previous popular scene parsing methods, instance segmentation methods +and unsupervised domain adaption methods. Furthermore, considering the vast +difference in instance sizes, we propose a detail refining decoder for scene +parsing, which recovers the details of different semantic regions in traffic +scenes owing to the proposed TSP6K dataset. Experiments show its effectiveness +in parsing the traffic monitoring scenes. Code and dataset are available at +https://github.com/PengtaoJiang/TSP6K.",cs.CV,['cs.CV'] +"FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation",Chris Rockwell · Nilesh Kulkarni · Linyi Jin · Jeong Joon Park · Justin Johnson · David Fouhey, ,https://arxiv.org/abs/2403.03221,,,"FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation","Estimating relative camera poses between images has been a central problem in +computer vision. Methods that find correspondences and solve for the +fundamental matrix offer high precision in most cases. Conversely, methods +predicting pose directly using neural networks are more robust to limited +overlap and can infer absolute translation scale, but at the expense of reduced +precision. We show how to combine the best of both methods; our approach yields +results that are both precise and robust, while also accurately inferring +translation scales. At the heart of our model lies a Transformer that (1) +learns to balance between solved and learned pose estimations, and (2) provides +a prior to guide a solver. A comprehensive analysis supports our design choices +and demonstrates that our method adapts flexibly to various feature extractors +and correspondence estimators, showing state-of-the-art performance in 6DoF +pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free +Relocalization.",cs.CV,['cs.CV'] +Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing,Ling Lo · Cheng Yeo · Hong-Han Shuai · Wen-Huang Cheng, ,https://arxiv.org/abs/2402.02583,,,DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing,"Large-scale Text-to-Image (T2I) diffusion models have revolutionized image +generation over the last few years. Although owning diverse and high-quality +generation capabilities, translating these abilities to fine-grained image +editing remains challenging. In this paper, we propose DiffEditor to rectify +two weaknesses in existing diffusion-based image editing: (1) in complex +scenarios, editing results often lack editing accuracy and exhibit unexpected +artifacts; (2) lack of flexibility to harmonize editing operations, e.g., +imagine new content. In our solution, we introduce image prompts in +fine-grained image editing, cooperating with the text prompt to better describe +the editing content. To increase the flexibility while maintaining content +consistency, we locally combine stochastic differential equation (SDE) into the +ordinary differential equation (ODE) sampling. In addition, we incorporate +regional score-based gradient guidance and a time travel strategy into the +diffusion sampling, further improving the editing quality. Extensive +experiments demonstrate that our method can efficiently achieve +state-of-the-art performance on various fine-grained image editing tasks, +including editing within a single image (e.g., object moving, resizing, and +content dragging) and across images (e.g., appearance replacing and object +pasting). Our source code is released at +https://github.com/MC-E/DragonDiffusion.",cs.CV,"['cs.CV', 'cs.LG']" +"Point, Segment and Count: A Generalized Framework for Object Counting",Zhizhong Huang · Mingliang Dai · Yi Zhang · Junping Zhang · Hongming Shan, ,https://arxiv.org/abs/2311.12386,,2311.12386.pdf,"Point, Segment and Count: A Generalized Framework for Object Counting","Class-agnostic object counting aims to count all objects in an image with +respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot +counting. In this paper, we propose a generalized framework for both few-shot +and zero-shot object counting based on detection. Our framework combines the +superior advantages of two foundation models without compromising their +zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask +proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate +object counts. However, this strategy meets the obstacles of efficiency +overhead and the small crowded objects that cannot be localized and +distinguished. To address these issues, our framework, termed PseCo, follows +three steps: point, segment, and count. Specifically, we first propose a +class-agnostic object localization to provide accurate but least point prompts +for SAM, which consequently not only reduces computation costs but also avoids +missing small objects. Furthermore, we propose a generalized object +classification that leverages CLIP image/text embeddings as the classifier, +following a hierarchical knowledge distillation to obtain discriminative +classifications among hierarchical mask proposals. Extensive experimental +results on FSC-147, COCO, and LVIS demonstrate that PseCo achieves +state-of-the-art performance in both few-shot/zero-shot object +counting/detection. Code: https://github.com/Hzzone/PseCo",cs.CV,['cs.CV'] +On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm,Peng Sun · Bei Shi · Daiwei Yu · Tao Lin, ,https://arxiv.org/abs/2312.03526,,2312.03526.pdf,On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm,"Contemporary machine learning requires training large neural networks on +massive datasets and thus faces the challenges of high computational demands. +Dataset distillation, as a recent emerging strategy, aims to compress +real-world datasets for efficient training. However, this line of research +currently struggle with large-scale and high-resolution datasets, hindering its +practicality and feasibility. To this end, we re-examine the existing dataset +distillation methods and identify three properties required for large-scale +real-world applications, namely, realism, diversity, and efficiency. As a +remedy, we propose RDED, a novel computationally-efficient yet effective data +distillation paradigm, to enable both diversity and realism of the distilled +data. Extensive empirical results over various neural architectures and +datasets demonstrate the advancement of RDED: we can distill the full +ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, +achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU +(while the SOTA only achieves 21% but requires 6 hours).",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces,Linyi Jin · Nilesh Kulkarni · David Fouhey,https://jinlinyi.github.io/3DFIRES/,https://arxiv.org/abs/2403.08768,,2403.08768.pdf,3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface,"This paper introduces 3DFIRES, a novel system for scene-level 3D +reconstruction from posed images. Designed to work with as few as one view, +3DFIRES reconstructs the complete geometry of unseen scenes, including hidden +surfaces. With multiple view inputs, our method produces full reconstruction +within all camera frustums. A key feature of our approach is the fusion of +multi-view information at the feature level, enabling the production of +coherent and comprehensive 3D reconstruction. We train our system on +non-watertight scans from large-scale real scene dataset. We show it matches +the efficacy of single-view reconstruction methods with only one input and +surpasses existing techniques in both quantitative and qualitative measures for +sparse-view 3D reconstruction.",cs.CV,['cs.CV'] +AlignMiF: Geometry-Aligned Multimodal Implicit Field for Enhanced LiDAR-Camera Joint Synthesis,Tao Tang · Guangrun Wang · Yixing Lao · Peng Chen · Jie Liu · Liang Lin · Kaicheng Yu · Xiaodan Liang, ,https://arxiv.org/abs/2402.17483,,2402.17483.pdf,AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis,"Neural implicit fields have been a de facto standard in novel view synthesis. +Recently, there exist some methods exploring fusing multiple modalities within +a single field, aiming to share implicit features from different modalities to +enhance reconstruction performance. However, these modalities often exhibit +misaligned behaviors: optimizing for one modality, such as LiDAR, can adversely +affect another, like camera performance, and vice versa. In this work, we +conduct comprehensive analyses on the multimodal implicit field of LiDAR-camera +joint synthesis, revealing the underlying issue lies in the misalignment of +different sensors. Furthermore, we introduce AlignMiF, a geometrically aligned +multimodal implicit field with two proposed modules: Geometry-Aware Alignment +(GAA) and Shared Geometry Initialization (SGI). These modules effectively align +the coarse geometry across different modalities, significantly enhancing the +fusion process between LiDAR and camera data. Through extensive experiments +across various datasets and scenes, we demonstrate the effectiveness of our +approach in facilitating better interaction between LiDAR and camera modalities +within a unified neural field. Specifically, our proposed AlignMiF, achieves +remarkable improvement over recent implicit fusion methods (+2.01 and +3.11 +image PSNR on the KITTI-360 and Waymo datasets) and consistently surpasses +single modality performance (13.8% and 14.2% reduction in LiDAR Chamfer +Distance on the respective datasets).",cs.CV,['cs.CV'] +Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction,Zhenzhong Kuang · Xiaochen Yang · Yingjie Shen · Chao Hu · Jun Yu, ,https://arxiv.org/abs/2309.04228,,2309.04228.pdf,FIVA: Facial Image and Video Anonymization and Anonymization Defense,"In this paper, we present a new approach for facial anonymization in images +and videos, abbreviated as FIVA. Our proposed method is able to maintain the +same face anonymization consistently over frames with our suggested +identity-tracking and guarantees a strong difference from the original face. +FIVA allows for 0 true positives for a false acceptance rate of 0.001. Our work +considers the important security issue of reconstruction attacks and +investigates adversarial noise, uniform noise, and parameter noise to disrupt +reconstruction attacks. In this regard, we apply different defense and +protection methods against these privacy threats to demonstrate the scalability +of FIVA. On top of this, we also show that reconstruction attack models can be +used for detection of deep fakes. Last but not least, we provide experimental +results showing how FIVA can even enable face swapping, which is purely trained +on a single target image.",cs.CV,['cs.CV'] +EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation,Md Mostafijur Rahman · Mustafa Munir · Radu Marculescu,https://github.com/SLDGroup/EMCAD,https://arxiv.org/abs/2405.06880,,2405.06880.pdf,EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation,"An efficient and effective decoding mechanism is crucial in medical image +segmentation, especially in scenarios with limited computational resources. +However, these decoding mechanisms usually come with high computational costs. +To address this concern, we introduce EMCAD, a new efficient multi-scale +convolutional attention decoder, designed to optimize both performance and +computational efficiency. EMCAD leverages a unique multi-scale depth-wise +convolution block, significantly enhancing feature maps through multi-scale +convolutions. EMCAD also employs channel, spatial, and grouped (large-kernel) +gated attention mechanisms, which are highly effective at capturing intricate +spatial relationships while focusing on salient regions. By employing group and +depth-wise convolution, EMCAD is very efficient and scales well (e.g., only +1.91M parameters and 0.381G FLOPs are needed when using a standard encoder). +Our rigorous evaluations across 12 datasets that belong to six medical image +segmentation tasks reveal that EMCAD achieves state-of-the-art (SOTA) +performance with 79.4% and 80.3% reduction in #Params and #FLOPs, respectively. +Moreover, EMCAD's adaptability to different encoders and versatility across +segmentation tasks further establish EMCAD as a promising tool, advancing the +field towards more efficient and accurate medical image analysis. Our +implementation is available at https://github.com/SLDGroup/EMCAD.",eess.IV,"['eess.IV', 'cs.CV']" +UniDepth: Universal Monocular Metric Depth Estimation,Luigi Piccinelli · Yung-Hsu Yang · Christos Sakaridis · Mattia Segu · Siyuan Li · Luc Van Gool · Fisher Yu,https://github.com/lpiccinelli-eth/unidepth,https://arxiv.org/abs/2403.18913,,2403.18913.pdf,UniDepth: Universal Monocular Metric Depth Estimation,"Accurate monocular metric depth estimation (MMDE) is crucial to solving +downstream tasks in 3D perception and modeling. However, the remarkable +accuracy of recent MMDE methods is confined to their training domains. These +methods fail to generalize to unseen domains even in the presence of moderate +domain gaps, which hinders their practical applicability. We propose a new +model, UniDepth, capable of reconstructing metric 3D scenes from solely single +images across domains. Departing from the existing MMDE methods, UniDepth +directly predicts metric 3D points from the input image at inference time +without any additional information, striving for a universal and flexible MMDE +solution. In particular, UniDepth implements a self-promptable camera module +predicting dense camera representation to condition depth features. Our model +exploits a pseudo-spherical output representation, which disentangles camera +and depth representations. In addition, we propose a geometric invariance loss +that promotes the invariance of camera-prompted depth features. Thorough +evaluations on ten datasets in a zero-shot regime consistently demonstrate the +superior performance of UniDepth, even when compared with methods directly +trained on the testing domains. Code and models are available at: +https://github.com/lpiccinelli-eth/unidepth",cs.CV,['cs.CV'] +Learning from Synthetic Human Group Activities,Che-Jui Chang · Danrui Li · Deep Patel · Parth Goel · Seonghyeon Moon · Samuel Sohn · Honglu Zhou · Sejong Yoon · Vladimir Pavlovic · Mubbasir Kapadia,https://cjerry1243.github.io/M3Act/,https://arxiv.org/abs/2306.16772,,2306.16772.pdf,M3Act: Learning from Synthetic Human Group Activities,"The study of complex human interactions and group activities has become a +focal point in human-centric computer vision. However, progress in related +tasks is often hindered by the challenges of obtaining large-scale labeled +datasets from real-world scenarios. To address the limitation, we introduce +M3Act, a synthetic data generator for multi-view multi-group multi-person human +atomic actions and group activities. Powered by Unity Engine, M3Act features +multiple semantic groups, highly diverse and photorealistic images, and a +comprehensive set of annotations, which facilitates the learning of +human-centered tasks across single-person, multi-person, and multi-group +conditions. We demonstrate the advantages of M3Act across three core +experiments. The results suggest our synthetic dataset can significantly +improve the performance of several downstream methods and replace real-world +datasets to reduce cost. Notably, M3Act improves the state-of-the-art MOTRv2 on +DanceTrack dataset, leading to a hop on the leaderboard from 10th to 2nd place. +Moreover, M3Act opens new research for controllable 3D group activity +generation. We define multiple metrics and propose a competitive baseline for +the novel task. Our code and data are available at our project page: +http://cjerry1243.github.io/M3Act.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory,飞 叶 · Adrian Bors, ,,https://ojs.aaai.org/index.php/AAAI/article/view/29582,,,,,nan +AMU-Tuning: Learning Effective Bias for CLIP-based Few-shot Classification,Yuwei Tang · ZhenYi Lin · Qilong Wang · Pengfei Zhu · Qinghua Hu, ,https://arxiv.org/abs/2404.08958,,2404.08958.pdf,AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning,"Recently, pre-trained vision-language models (e.g., CLIP) have shown great +potential in few-shot learning and attracted a lot of research interest. +Although efforts have been made to improve few-shot ability of CLIP, key +factors on the effectiveness of existing methods have not been well studied, +limiting further exploration of CLIP's potential in few-shot learning. In this +paper, we first introduce a unified formulation to analyze CLIP-based few-shot +learning methods from a perspective of logit bias, which encourages us to learn +an effective logit bias for further improving performance of CLIP-based +few-shot learning methods. To this end, we disassemble three key components +involved in computation of logit bias (i.e., logit features, logit predictor, +and logit fusion) and empirically analyze the effect on performance of few-shot +classification. Based on analysis of key components, this paper proposes a +novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot +classification. Specifically, our AMU-Tuning predicts logit bias by exploiting +the appropriate $\underline{\textbf{A}}$uxiliary features, which are fed into +an efficient feature-initialized linear classifier with +$\underline{\textbf{M}}$ulti-branch training. Finally, an +$\underline{\textbf{U}}$ncertainty-based fusion is developed to incorporate +logit bias into CLIP for few-shot classification. The experiments are conducted +on several widely used benchmarks, and the results show AMU-Tuning clearly +outperforms its counterparts while achieving state-of-the-art performance of +CLIP-based few-shot learning without bells and whistles.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +CoGS: Controllable Gaussian Splatting,Heng Yu · Joel Julin · Zoltán Á. Milacski · Koichiro Niinuma · László A. Jeni,https://cogs2024.github.io,https://arxiv.org/abs/2312.05664,,2312.05664.pdf,CoGS: Controllable Gaussian Splatting,"Capturing and re-animating the 3D structure of articulated objects present +significant barriers. On one hand, methods requiring extensively calibrated +multi-view setups are prohibitively complex and resource-intensive, limiting +their practical applicability. On the other hand, while single-camera Neural +Radiance Fields (NeRFs) offer a more streamlined approach, they have excessive +training and rendering costs. 3D Gaussian Splatting would be a suitable +alternative but for two reasons. Firstly, existing methods for 3D dynamic +Gaussians require synchronized multi-view cameras, and secondly, the lack of +controllability in dynamic scenarios. We present CoGS, a method for +Controllable Gaussian Splatting, that enables the direct manipulation of scene +elements, offering real-time control of dynamic scenes without the prerequisite +of pre-computing control signals. We evaluated CoGS using both synthetic and +real-world datasets that include dynamic objects that differ in degree of +difficulty. In our evaluations, CoGS consistently outperformed existing dynamic +and controllable neural representations in terms of visual fidelity.",cs.CV,['cs.CV'] +Neural Spline Fields for Burst Image Fusion and Layer Separation,Ilya Chugunov · David Shustin · Ruyu Yan · Chenyang Lei · Felix Heide, ,https://arxiv.org/abs/2312.14235,,2312.14235.pdf,Neural Spline Fields for Burst Image Fusion and Layer Separation,"Each photo in an image burst can be considered a sample of a complex 3D +scene: the product of parallax, diffuse and specular materials, scene motion, +and illuminant variation. While decomposing all of these effects from a stack +of misaligned images is a highly ill-conditioned task, the conventional +align-and-merge burst pipeline takes the other extreme: blending them into a +single image. In this work, we propose a versatile intermediate representation: +a two-layer alpha-composited image plus flow model constructed with neural +spline fields -- networks trained to map input coordinates to spline control +points. Our method is able to, during test-time optimization, jointly fuse a +burst image capture into one high-resolution reconstruction and decompose it +into transmission and obstruction layers. Then, by discarding the obstruction +layer, we can perform a range of tasks including seeing through occlusions, +reflection suppression, and shadow removal. Validated on complex synthetic and +in-the-wild captures we find that, with no post-processing steps or learned +priors, our generalizable model is able to outperform existing dedicated +single-image and multi-view obstruction removal approaches.",cs.CV,['cs.CV'] +Object Recognition as Next Token Prediction,Kaiyu Yue · Bor-Chun Chen · Jonas Geiping · Hengduo Li · Tom Goldstein · Ser-Nam Lim,https://github.com/kaiyuyue/nxtp,,https://www.semanticscholar.org/paper/Object-Recognition-as-Next-Token-Prediction-Yue-Chen/529a3164a4ef5c227b6a775f73936866cb51d72f,,,,,nan +GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models,Taoran Yi · Jiemin Fang · Junjie Wang · Guanjun Wu · Lingxi Xie · Xiaopeng Zhang · Wenyu Liu · Qi Tian · Xinggang Wang,https://taoranyi.com/gaussiandreamer/,https://arxiv.org/abs/2310.08529v3,,2310.08529v3.pdf,GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models,"In recent times, the generation of 3D assets from text prompts has shown +impressive results. Both 2D and 3D diffusion models can help generate decent 3D +objects based on prompts. 3D diffusion models have good 3D consistency, but +their quality and generalization are limited as trainable 3D data is expensive +and hard to obtain. 2D diffusion models enjoy strong abilities of +generalization and fine generation, but 3D consistency is hard to guarantee. +This paper attempts to bridge the power from the two types of diffusion models +via the recent explicit and efficient 3D Gaussian splatting representation. A +fast 3D object generation framework, named as GaussianDreamer, is proposed, +where the 3D diffusion model provides priors for initialization and the 2D +diffusion model enriches the geometry and appearance. Operations of noisy point +growing and color perturbation are introduced to enhance the initialized +Gaussians. Our GaussianDreamer can generate a high-quality 3D instance or 3D +avatar within 15 minutes on one GPU, much faster than previous methods, while +the generated instances can be directly rendered in real time. Demos and code +are available at https://taoranyi.com/gaussiandreamer/.",cs.CV,"['cs.CV', 'cs.GR']" +APISR: Anime Production Inspired Real-World Anime Super-Resolution,Boyang Wang · Fengyu Yang · Xihang Yu · Chao Zhang · Hanbin Zhao, ,https://arxiv.org/abs/2403.01598,,2403.01598.pdf,APISR: Anime Production Inspired Real-World Anime Super-Resolution,"While real-world anime super-resolution (SR) has gained increasing attention +in the SR community, existing methods still adopt techniques from the +photorealistic domain. In this paper, we analyze the anime production workflow +and rethink how to use characteristics of it for the sake of the real-world +anime SR. First, we argue that video networks and datasets are not necessary +for anime SR due to the repetition use of hand-drawing frames. Instead, we +propose an anime image collection pipeline by choosing the least compressed and +the most informative frames from the video sources. Based on this pipeline, we +introduce the Anime Production-oriented Image (API) dataset. In addition, we +identify two anime-specific challenges of distorted and faint hand-drawn lines +and unwanted color artifacts. We address the first issue by introducing a +prediction-oriented compression module in the image degradation model and a +pseudo-ground truth preparation with enhanced hand-drawn lines. In addition, we +introduce the balanced twin perceptual loss combining both anime and +photorealistic high-level features to mitigate unwanted color artifacts and +increase visual clarity. We evaluate our method through extensive experiments +on the public benchmark, showing our method outperforms state-of-the-art anime +dataset-trained approaches.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV']" +Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation,Zhiwei Yang · Kexue Fu · Minghong Duan · Linhao Qu · Shuo Wang · Zhijian Song,https://github.com/zwyang6/SeCo,https://arxiv.org/abs/2402.18467,,2402.18467.pdf,Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation,"Weakly supervised semantic segmentation (WSSS) with image-level labels aims +to achieve segmentation tasks without dense annotations. However, attributed to +the frequent coupling of co-occurring objects and the limited supervision from +image-level labels, the challenging co-occurrence problem is widely present and +leads to false activation of objects in WSSS. In this work, we devise a +'Separate and Conquer' scheme SeCo to tackle this issue from dimensions of +image space and feature space. In the image space, we propose to 'separate' the +co-occurring objects with image decomposition by subdividing images into +patches. Importantly, we assign each patch a category tag from Class Activation +Maps (CAMs), which spatially helps remove the co-context bias and guide the +subsequent representation. In the feature space, we propose to 'conquer' the +false activation by enhancing semantic representation with multi-granularity +knowledge contrast. To this end, a dual-teacher-single-student architecture is +designed and tag-guided contrast is conducted, which guarantee the correctness +of knowledge and further facilitate the discrepancy among co-contexts. We +streamline the multi-staged WSSS pipeline end-to-end and tackle this issue +without external supervision. Extensive experiments are conducted, validating +the efficiency of our method and the superiority over previous single-staged +and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available +at https://github.com/zwyang6/SeCo.git.",cs.CV,['cs.CV'] +Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation,Sixian Zhang · Xinyao Yu · Xinhang Song · Xiaohan Wang · Shuqiang Jiang, ,,http://vipl.ict.ac.cn/en/news/researchevents/202403/t20240315_207762.html,,,,,nan +MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis,Dewei Zhou · You Li · Fan Ma · Xiaoting Zhang · Yi Yang, ,https://arxiv.org/abs/2402.05408,,2402.05408.pdf,MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis,"We present a Multi-Instance Generation (MIG) task, simultaneously generating +multiple instances with diverse controls in one image. Given a set of +predefined coordinates and their corresponding descriptions, the task is to +ensure that generated instances are accurately at the designated locations and +that all instances' attributes adhere to their corresponding description. This +broadens the scope of current research on Single-instance generation, elevating +it to a more versatile and practical dimension. Inspired by the idea of divide +and conquer, we introduce an innovative approach named Multi-Instance +Generation Controller (MIGC) to address the challenges of the MIG task. +Initially, we break down the MIG task into several subtasks, each involving the +shading of a single instance. To ensure precise shading for each instance, we +introduce an instance enhancement attention mechanism. Lastly, we aggregate all +the shaded instances to provide the necessary information for accurately +generating multiple instances in stable diffusion (SD). To evaluate how well +generation models perform on the MIG task, we provide a COCO-MIG benchmark +along with an evaluation pipeline. Extensive experiments were conducted on the +proposed COCO-MIG benchmark, as well as on various commonly used benchmarks. +The evaluation results illustrate the exceptional control capabilities of our +model in terms of quantity, position, attribute, and interaction. Code and +demos will be released at https://migcproject.github.io/.",cs.CV,['cs.CV'] +Transfer CLIP for Generalizable Image Denoising,Jun Cheng · Dong Liang · Shan Tan,https://github.com/alwaysuu/CLIPDenoising,https://arxiv.org/abs/2403.15132,,,Transfer CLIP for Generalizable Image Denoising,"Image denoising is a fundamental task in computer vision. While prevailing +deep learning-based supervised and self-supervised methods have excelled in +eliminating in-distribution noise, their susceptibility to out-of-distribution +(OOD) noise remains a significant challenge. The recent emergence of +contrastive language-image pre-training (CLIP) model has showcased exceptional +capabilities in open-world image recognition and segmentation. Yet, the +potential for leveraging CLIP to enhance the robustness of low-level tasks +remains largely unexplored. This paper uncovers that certain dense features +extracted from the frozen ResNet image encoder of CLIP exhibit +distortion-invariant and content-related properties, which are highly desirable +for generalizable denoising. Leveraging these properties, we devise an +asymmetrical encoder-decoder denoising network, which incorporates dense +features including the noisy image and its multi-scale features from the frozen +ResNet encoder of CLIP into a learnable image decoder to achieve generalizable +denoising. The progressive feature augmentation strategy is further proposed to +mitigate feature overfitting and improve the robustness of the learnable +decoder. Extensive experiments and comparisons conducted across diverse OOD +noises, including synthetic noise, real-world sRGB noise, and low-dose CT image +noise, demonstrate the superior generalization ability of our method.",cs.CV,"['cs.CV', 'eess.IV']" +Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning,Yiwen Ye · Yutong Xie · Jianpeng Zhang · Ziyang Chen · Qi Wu · Yong Xia, ,https://arxiv.org/abs/2311.17597,,2311.17597.pdf,Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning,"Self-supervised learning is an efficient pre-training method for medical +image analysis. However, current research is mostly confined to +specific-modality data pre-training, consuming considerable time and resources +without achieving universality across different modalities. A straightforward +solution is combining all modality data for joint self-supervised pre-training, +which poses practical challenges. Firstly, our experiments reveal conflicts in +representation learning as the number of modalities increases. Secondly, +multi-modal data collected in advance cannot cover all real-world scenarios. In +this paper, we reconsider versatile self-supervised learning from the +perspective of continual learning and propose MedCoSS, a continuous +self-supervised learning approach for multi-modal medical data. Unlike joint +self-supervised learning, MedCoSS assigns different modality data to different +training stages, forming a multi-stage pre-training process. To balance modal +conflicts and prevent catastrophic forgetting, we propose a rehearsal-based +continual learning method. We introduce the k-means sampling strategy to retain +data from previous modalities and rehearse it when learning new modalities. +Instead of executing the pretext task on buffer data, a feature distillation +strategy and an intra-modal mixup strategy are applied to these data for +knowledge retention. We conduct continuous self-supervised pre-training on a +large-scale multi-modal unlabeled dataset, including clinical reports, X-rays, +CT scans, MRI scans, and pathological images. Experimental results demonstrate +MedCoSS's exceptional generalization ability across nine downstream datasets +and its significant scalability in integrating new modality data. Code and +pre-trained weight are available at https://github.com/yeerwen/MedCoSS.",cs.CV,['cs.CV'] +OmniVid: A Generative Framework for Universal Video Understanding,Junke Wang · Dongdong Chen · Chong Luo · Bo He · Lu Yuan · Zuxuan Wu · Yu-Gang Jiang, ,https://arxiv.org/abs/2403.17935,,2403.17935.pdf,OmniVid: A Generative Framework for Universal Video Understanding,"The core of video understanding tasks, such as recognition, captioning, and +tracking, is to automatically detect objects or actions in a video and analyze +their temporal evolution. Despite sharing a common goal, different tasks often +rely on distinct model architectures and annotation formats. In contrast, +natural language processing benefits from a unified output space, i.e., text +sequences, which simplifies the training of powerful foundational language +models, such as GPT-3, with extensive training corpora. Inspired by this, we +seek to unify the output space of video understanding tasks by using languages +as labels and additionally introducing time and box tokens. In this way, a +variety of video tasks could be formulated as video-grounded token generation. +This enables us to address various types of video tasks, including +classification (such as action recognition), captioning (covering clip +captioning, video question answering, and dense video captioning), and +localization tasks (such as visual object tracking) within a fully shared +encoder-decoder architecture, following a generative framework. Through +comprehensive experiments, we demonstrate such a simple and straightforward +idea is quite effective and can achieve state-of-the-art or competitive results +on seven video benchmarks, providing a novel perspective for more universal +video understanding. Code is available at https://github.com/wangjk666/OmniVid.",cs.CV,['cs.CV'] +Learning from One Continuous Video Stream,Joao Carreira · Michael King · Viorica Patraucean · Dilara Gokay · Catalin Ionescu · Yi Yang · Daniel Zoran · Joseph Heyward · Carl Doersch · Yusuf Aytar · Dima Damen · Andrew Zisserman, ,https://arxiv.org/abs/2312.00598,,2312.00598.pdf,Learning from One Continuous Video Stream,"We introduce a framework for online learning from a single continuous video +stream -- the way people and animals learn, without mini-batches, data +augmentation or shuffling. This poses great challenges given the high +correlation between consecutive video frames and there is very little prior +work on it. Our framework allows us to do a first deep dive into the topic and +includes a collection of streams and tasks composed from two existing video +datasets, plus methodology for performance evaluation that considers both +adaptation and generalization. We employ pixel-to-pixel modelling as a +practical and flexible way to switch between pre-training and single-stream +evaluation as well as between arbitrary tasks, without ever requiring changes +to models and always using the same pixel loss. Equipped with this framework we +obtained large single-stream learning gains from pre-training with a novel +family of future prediction tasks, found that momentum hurts, and that the pace +of weight updates matters. The combination of these insights leads to matching +the performance of IID learning with batch size 1, when using the same +architecture and without costly replay buffers.",cs.CV,"['cs.CV', 'cs.AI']" +Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains,Bang-Dang Pham · Phong Tran · Anh Tran · Cuong Pham · Rang Nguyen · Minh Hoai,https://zero1778.github.io/blur2blur/,https://arxiv.org/abs/2403.16205,,2403.16205.pdf,Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains,"This paper presents an innovative framework designed to train an image +deblurring algorithm tailored to a specific camera device. This algorithm works +by transforming a blurry input image, which is challenging to deblur, into +another blurry image that is more amenable to deblurring. The transformation +process, from one blurry state to another, leverages unpaired data consisting +of sharp and blurry images captured by the target camera device. Learning this +blur-to-blur transformation is inherently simpler than direct blur-to-sharp +conversion, as it primarily involves modifying blur patterns rather than the +intricate task of reconstructing fine image details. The efficacy of the +proposed approach has been demonstrated through comprehensive experiments on +various benchmarks, where it significantly outperforms state-of-the-art methods +both quantitatively and qualitatively. Our code and data are available at +https://zero1778.github.io/blur2blur/",cs.CV,['cs.CV'] +Video Harmonization with Triplet Spatio-Temporal Variation Patterns,Zonghui Guo · XinYu Han · Jie Zhang · Shiguang Shan · Haiyong Zheng,https://github.com/zhenglab/VideoTripletTransformer,,http://vipl.ict.ac.cn/en/news/researchevents/202403/t20240315_207762.html,,,,,nan +D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval,Yi Xie · Yihong Lin · Wenjie Cai · Xuemiao Xu · Huaidong Zhang · Yong Du · Shengfeng He, ,https://arxiv.org/abs/2403.01431,,2403.01431.pdf,Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval,"The task of composed image retrieval (CIR) aims to retrieve images based on +the query image and the text describing the users' intent. Existing methods +have made great progress with the advanced large vision-language (VL) model in +CIR task, however, they generally suffer from two main issues: lack of labeled +triplets for model training and difficulty of deployment on resource-restricted +environments when deploying the large vision-language model. To tackle the +above problems, we propose Image2Sentence based Asymmetric zero-shot composed +image retrieval (ISA), which takes advantage of the VL model and only relies on +unlabeled images for composition learning. In the framework, we propose a new +adaptive token learner that maps an image to a sentence in the word embedding +space of VL model. The sentence adaptively captures discriminative visual +information and is further integrated with the text modifier. An asymmetric +structure is devised for flexible deployment, in which the lightweight model is +adopted for the query side while the large VL model is deployed on the gallery +side. The global contrastive distillation and the local alignment +regularization are adopted for the alignment between the light model and the VL +model for CIR task. Our experiments demonstrate that the proposed ISA could +better cope with the real retrieval scenarios and further improve retrieval +accuracy and efficiency.",cs.CV,['cs.CV'] +DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks,Jiaxin Zhang · Dezhi Peng · Chongyu Liu · Peirong Zhang · Lianwen Jin,https://github.com/ZZZHANG-jx/DocRes,https://arxiv.org/abs/2405.04408,,2405.04408.pdf,DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks,"Document image restoration is a crucial aspect of Document AI systems, as the +quality of document images significantly influences the overall performance. +Prevailing methods address distinct restoration tasks independently, leading to +intricate systems and the incapability to harness the potential synergies of +multi-task learning. To overcome this challenge, we propose DocRes, a +generalist model that unifies five document image restoration tasks including +dewarping, deshadowing, appearance enhancement, deblurring, and binarization. +To instruct DocRes to perform various restoration tasks, we propose a novel +visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The +DTSPrompt for different tasks comprises distinct prior features, which are +additional characteristics extracted from the input image. Beyond its role as a +cue for task-specific execution, DTSPrompt can also serve as supplementary +information to enhance the model's performance. Moreover, DTSPrompt is more +flexible than prior visual prompt approaches as it can be seamlessly applied +and adapted to inputs with high and variable resolutions. Experimental results +demonstrate that DocRes achieves competitive or superior performance compared +to existing state-of-the-art task-specific models. This underscores the +potential of DocRes across a broader spectrum of document image restoration +tasks. The source code is publicly available at +https://github.com/ZZZHANG-jx/DocRes",cs.CV,['cs.CV'] +Unified Entropy Optimization for Open-Set Test-Time Adaptation,Zhengqing Gao · Xu-Yao Zhang · Cheng-Lin Liu,https://github.com/gaozhengqing/UniEnt,https://arxiv.org/abs/2404.06065,,2404.06065.pdf,Unified Entropy Optimization for Open-Set Test-Time Adaptation,"Test-time adaptation (TTA) aims at adapting a model pre-trained on the +labeled source domain to the unlabeled target domain. Existing methods usually +focus on improving TTA performance under covariate shifts, while neglecting +semantic shifts. In this paper, we delve into a realistic open-set TTA setting +where the target domain may contain samples from unknown classes. Many +state-of-the-art closed-set TTA methods perform poorly when applied to open-set +scenarios, which can be attributed to the inaccurate estimation of data +distribution and model confidence. To address these issues, we propose a simple +but effective framework called unified entropy optimization (UniEnt), which is +capable of simultaneously adapting to covariate-shifted in-distribution (csID) +data and detecting covariate-shifted out-of-distribution (csOOD) data. +Specifically, UniEnt first mines pseudo-csID and pseudo-csOOD samples from test +data, followed by entropy minimization on the pseudo-csID data and entropy +maximization on the pseudo-csOOD data. Furthermore, we introduce UniEnt+ to +alleviate the noise caused by hard data partition leveraging sample-level +confidence. Extensive experiments on CIFAR benchmarks and Tiny-ImageNet-C show +the superiority of our framework. The code is available at +https://github.com/gaozhengqing/UniEnt",cs.CV,['cs.CV'] +Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology,Wenhao Tang · Fengtao ZHOU · Sheng Huang · Xiang Zhu · Yi Zhang · Bo Liu, ,https://arxiv.org/abs/2402.17228,,2402.17228.pdf,Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology,"Multiple instance learning (MIL) is the most widely used framework in +computational pathology, encompassing sub-typing, diagnosis, prognosis, and +more. However, the existing MIL paradigm typically requires an offline instance +feature extractor, such as a pre-trained ResNet or a foundation model. This +approach lacks the capability for feature fine-tuning within the specific +downstream tasks, limiting its adaptability and performance. To address this +issue, we propose a Re-embedded Regional Transformer (R$^2$T) for re-embedding +the instance features online, which captures fine-grained local features and +establishes connections across different regions. Unlike existing works that +focus on pre-training powerful feature extractor or designing sophisticated +instance aggregator, R$^2$T is tailored to re-embed instance features online. +It serves as a portable module that can seamlessly integrate into mainstream +MIL models. Extensive experimental results on common computational pathology +tasks validate that: 1) feature re-embedding improves the performance of MIL +models based on ResNet-50 features to the level of foundation model features, +and further enhances the performance of foundation model features; 2) the +R$^2$T can introduce more significant performance improvements to various MIL +models; 3) R$^2$T-MIL, as an R$^2$T-enhanced AB-MIL, outperforms other latest +methods by a large margin.The code is available at: +https://github.com/DearCaat/RRT-MIL.",cs.CV,['cs.CV'] +Gradient-based Parameter Selection for Efficient Fine-Tuning,Zhi Zhang · Qizhe Zhang · Zijun Gao · Renrui Zhang · Ekaterina Shutova · Shiji Zhou · Shanghang Zhang, ,https://arxiv.org/abs/2312.10136,,2312.10136.pdf,Gradient-based Parameter Selection for Efficient Fine-Tuning,"With the growing size of pre-trained models, full fine-tuning and storing all +the parameters for various downstream tasks is costly and infeasible. In this +paper, we propose a new parameter-efficient fine-tuning method, Gradient-based +Parameter Selection (GPS), demonstrating that only tuning a few selected +parameters from the pre-trained model while keeping the remainder of the model +frozen can generate similar or better performance compared with the full model +fine-tuning method. Different from the existing popular and state-of-the-art +parameter-efficient fine-tuning approaches, our method does not introduce any +additional parameters and computational costs during both the training and +inference stages. Another advantage is the model-agnostic and non-destructive +property, which eliminates the need for any other design specific to a +particular model. Compared with the full fine-tuning, GPS achieves 3.33% +(91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the +accuracy with tuning only 0.36% parameters of the pre-trained model on average +over 24 image classification tasks; it also demonstrates a significant +improvement of 17% and 16.8% in mDice and mIoU, respectively, on medical image +segmentation task. Moreover, GPS achieves state-of-the-art performance compared +with existing PEFT methods.",cs.CV,['cs.CV'] +UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model,Shuai Yuan · Lei Luo · Zhuo Hui · Can Pu · Xiaoyu Xiang · Rakesh Ranjan · Denis Demandolx, ,https://arxiv.org/abs/2405.02608,,2405.02608.pdf,UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model,"Traditional unsupervised optical flow methods are vulnerable to occlusions +and motion boundaries due to lack of object-level information. Therefore, we +propose UnSAMFlow, an unsupervised flow network that also leverages object +information from the latest foundation model Segment Anything Model (SAM). We +first include a self-supervised semantic augmentation module tailored to SAM +masks. We also analyze the poor gradient landscapes of traditional smoothness +losses and propose a new smoothness definition based on homography instead. A +simple yet effective mask feature module has also been added to further +aggregate features on the object level. With all these adaptations, our method +produces clear optical flow estimation with sharp boundaries around objects, +which outperforms state-of-the-art methods on both KITTI and Sintel datasets. +Our method also generalizes well across domains and runs very efficiently.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +Diversity-aware Channel Pruning for StyleGAN Compression,Jiwoo Chung · Sangeek Hyun · Sang-Heon Shim · Jae-Pil Heo,https://jiwoogit.github.io/DCP-GAN_site/,https://arxiv.org/abs/2403.13548,,2403.13548.pdf,Diversity-aware Channel Pruning for StyleGAN Compression,"StyleGAN has shown remarkable performance in unconditional image generation. +However, its high computational cost poses a significant challenge for +practical applications. Although recent efforts have been made to compress +StyleGAN while preserving its performance, existing compressed models still lag +behind the original model, particularly in terms of sample diversity. To +overcome this, we propose a novel channel pruning method that leverages varying +sensitivities of channels to latent vectors, which is a key factor in sample +diversity. Specifically, by assessing channel importance based on their +sensitivities to latent vector perturbations, our method enhances the diversity +of samples in the compressed model. Since our method solely focuses on the +channel pruning stage, it has complementary benefits with prior training +schemes without additional training cost. Extensive experiments demonstrate +that our method significantly enhances sample diversity across various +datasets. Moreover, in terms of FID scores, our method not only surpasses +state-of-the-art by a large margin but also achieves comparable scores with +only half training iterations.",cs.CV,['cs.CV'] +Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning,Rashindrie Perera · Saman Halgamuge,https://github.com/rashindrie/DIPA,https://arxiv.org/abs/2403.04492,,2403.04492.pdf,Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning,"In this paper, we look at cross-domain few-shot classification which presents +the challenging task of learning new classes in previously unseen domains with +few labelled examples. Existing methods, though somewhat effective, encounter +several limitations, which we alleviate through two significant improvements. +First, we introduce a lightweight parameter-efficient adaptation strategy to +address overfitting associated with fine-tuning a large number of parameters on +small datasets. This strategy employs a linear transformation of pre-trained +features, significantly reducing the trainable parameter count. Second, we +replace the traditional nearest centroid classifier with a discriminative +sample-aware loss function, enhancing the model's sensitivity to the inter- and +intra-class variances within the training set for improved clustering in +feature space. Empirical evaluations on the Meta-Dataset benchmark showcase +that our approach not only improves accuracy up to 7.7\% and 5.3\% on +previously seen and unseen datasets, respectively, but also achieves the above +performance while being at least $\sim3\times$ more parameter-efficient than +existing methods, establishing a new state-of-the-art in cross-domain few-shot +learning. Our code is available at https://github.com/rashindrie/DIPA.",cs.CV,['cs.CV'] +FaceLift: Semi-supervised 3D Facial Landmark Localization,David Ferman · Pablo Garrido · Gaurav Bharaj,https://davidcferman.github.io/FaceLift/,https://arxiv.org/abs/2405.19646,,2405.19646.pdf,FaceLift: Semi-supervised 3D Facial Landmark Localization,"3D facial landmark localization has proven to be of particular use for +applications, such as face tracking, 3D face modeling, and image-based 3D face +reconstruction. In the supervised learning case, such methods usually rely on +3D landmark datasets derived from 3DMM-based registration that often lack +spatial definition alignment, as compared with that chosen by hand-labeled +human consensus, e.g., how are eyebrow landmarks defined? This creates a gap +between landmark datasets generated via high-quality 2D human labels and 3DMMs, +and it ultimately limits their effectiveness. To address this issue, we +introduce a novel semi-supervised learning approach that learns 3D landmarks by +directly lifting (visible) hand-labeled 2D landmarks and ensures better +definition alignment, without the need for 3D landmark datasets. To lift 2D +landmarks to 3D, we leverage 3D-aware GANs for better multi-view consistency +learning and in-the-wild multi-frame videos for robust cross-generalization. +Empirical experiments demonstrate that our method not only achieves better +definition alignment between 2D-3D landmarks but also outperforms other +supervised learning 3D landmark localization methods on both 3DMM labeled and +photogrammetric ground truth evaluation datasets. Project Page: +https://davidcferman.github.io/FaceLift",cs.CV,['cs.CV'] +MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models,Sanjoy Chowdhury · Sayan Nag · Joseph K J · Balaji Vasan Srinivasan · Dinesh Manocha, ,https://arxiv.org/abs/2310.13772,,2310.13772.pdf,TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models,"We present TexFusion (Texture Diffusion), a new method to synthesize textures +for given 3D geometries, using large-scale text-guided image diffusion models. +In contrast to recent works that leverage 2D text-to-image diffusion models to +distill 3D objects using a slow and fragile optimization process, TexFusion +introduces a new 3D-consistent generation technique specifically designed for +texture synthesis that employs regular diffusion model sampling on different 2D +rendered views. Specifically, we leverage latent diffusion models, apply the +diffusion model's denoiser on a set of 2D renders of the 3D object, and +aggregate the different denoising predictions on a shared latent texture map. +Final output RGB textures are produced by optimizing an intermediate neural +color field on the decodings of 2D renders of the latent texture. We thoroughly +validate TexFusion and show that we can efficiently generate diverse, high +quality and globally coherent textures. We achieve state-of-the-art text-guided +texture synthesis performance using only image diffusion models, while avoiding +the pitfalls of previous distillation-based methods. The text-conditioning +offers detailed control and we also do not rely on any ground truth 3D textures +for training. This makes our method versatile and applicable to a broad range +of geometry and texture types. We hope that TexFusion will advance AI-based +texturing of 3D assets for applications in virtual reality, game design, +simulation, and more.",cs.CV,"['cs.CV', 'cs.LG', 'I.3.3']" +Intensity-Robust Autofocus for Spike Camera,Changqing Su · Zhiyuan Ye · Yongsheng Xiao · You Zhou · Zhen Cheng · Bo Xiong · Zhaofei Yu · Tiejun Huang, ,https://arxiv.org/abs/2405.16790,,2405.16790.pdf,SCSim: A Realistic Spike Cameras Simulator,"Spike cameras, with their exceptional temporal resolution, are +revolutionizing high-speed visual applications. Large-scale synthetic datasets +have significantly accelerated the development of these cameras, particularly +in reconstruction and optical flow. However, current synthetic datasets for +spike cameras lack sophistication. Addressing this gap, we introduce SCSim, a +novel and more realistic spike camera simulator with a comprehensive noise +model. SCSim is adept at autonomously generating driving scenarios and +synthesizing corresponding spike streams. To enhance the fidelity of these +streams, we've developed a comprehensive noise model tailored to the unique +circuitry of spike cameras. Our evaluations demonstrate that SCSim outperforms +existing simulation methods in generating authentic spike streams. Crucially, +SCSim simplifies the creation of datasets, thereby greatly advancing +spike-based visual tasks like reconstruction. Our project refers to +https://github.com/Acnext/SCSim.",cs.CV,['cs.CV'] +SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields,"Quentin HERAU · Nathan Piasco · Moussab Bennehar · Luis Guiller,o Roldao Jimenez · Dzmitry Tsishkou · MigniotCyrille · Modélisation Information Systèmes · Cedric Demonceaux", ,https://arxiv.org/abs/2311.15803,,2311.15803.pdf,SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields,"In rapidly-evolving domains such as autonomous driving, the use of multiple +sensors with different modalities is crucial to ensure high operational +precision and stability. To correctly exploit the provided information by each +sensor in a single common frame, it is essential for these sensors to be +accurately calibrated. In this paper, we leverage the ability of Neural +Radiance Fields (NeRF) to represent different sensors modalities in a common +volumetric representation to achieve robust and accurate spatio-temporal sensor +calibration. By designing a partitioning approach based on the visible part of +the scene for each sensor, we formulate the calibration problem using only the +overlapping areas. This strategy results in a more robust and accurate +calibration that is less prone to failure. We demonstrate that our approach +works on outdoor urban scenes by validating it on multiple established driving +datasets. Results show that our method is able to get better accuracy and +robustness compared to existing methods.",cs.CV,"['cs.CV', 'cs.RO']" +Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations,Lei Fan · Jianxiong Zhou · Xiaoying Xing · Ying Wu, ,https://arxiv.org/abs/2311.17938,,2311.17938.pdf,Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations,"Active recognition, which allows intelligent agents to explore observations +for better recognition performance, serves as a prerequisite for various +embodied AI tasks, such as grasping, navigation and room arrangements. Given +the evolving environment and the multitude of object classes, it is impractical +to include all possible classes during the training stage. In this paper, we +aim at advancing active open-vocabulary recognition, empowering embodied agents +to actively perceive and classify arbitrary objects. However, directly adopting +recent open-vocabulary classification models, like Contrastive Language Image +Pretraining (CLIP), poses its unique challenges. Specifically, we observe that +CLIP's performance is heavily affected by the viewpoint and occlusions, +compromising its reliability in unconstrained embodied perception scenarios. +Further, the sequential nature of observations in agent-environment +interactions necessitates an effective method for integrating features that +maintains discriminative strength for open-vocabulary classification. To +address these issues, we introduce a novel agent for active open-vocabulary +recognition. The proposed method leverages inter-frame and inter-concept +similarities to navigate agent movements and to fuse features, without relying +on class-specific knowledge. Compared to baseline CLIP model with 29.6% +accuracy on ShapeNet dataset, the proposed agent could achieve 53.3% accuracy +for open-vocabulary recognition, without any fine-tuning to the equipped CLIP +model. Additional experiments conducted with the Habitat simulator further +affirm the efficacy of our method.",cs.CV,['cs.CV'] +2S-UDF: A Novel Two-stage UDF Learning Method for Robust Non-watertight Model Reconstruction from Multi-view Images,Junkai Deng · Fei Hou · Xuhui Chen · Wencheng Wang · Ying He, ,https://arxiv.org/abs/2308.09302,,2308.09302.pdf,Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms,"Robust audio anti-spoofing has been increasingly challenging due to the +recent advancements on deepfake techniques. While spectrograms have +demonstrated their capability for anti-spoofing, complementary information +presented in multi-order spectral patterns have not been well explored, which +limits their effectiveness for varying spoofing attacks. Therefore, we propose +a novel deep learning method with a spectral fusion-reconstruction strategy, +namely S2pecNet, to utilise multi-order spectral patterns for robust audio +anti-spoofing representations. Specifically, spectral patterns up to +second-order are fused in a coarse-to-fine manner and two branches are designed +for the fine-level fusion from the spectral and temporal contexts. A +reconstruction from the fused representation to the input spectrograms further +reduces the potential fused information loss. Our method achieved the +state-of-the-art performance with an EER of 0.77% on a widely used dataset: +ASVspoof2019 LA Challenge.",cs.SD,"['cs.SD', 'cs.AI', 'cs.MM', 'eess.AS']" +Sheared Backpropagation for Finetuning Foundation Models,Zhiyuan Yu · Li Shen · Liang Ding · Xinmei Tian · Yixin Chen · Dacheng Tao, ,https://arxiv.org/abs/2402.15017,,2402.15017.pdf,Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning,"Foundation models have emerged as a powerful tool for many AI problems. +Despite the tremendous success of foundation models, effective adaptation to +new tasks, particularly those with limited labels, remains an open question and +lacks theoretical understanding. An emerging solution with recent success in +vision and NLP involves finetuning a foundation model on a selection of +relevant tasks, before its adaptation to a target task with limited labeled +samples. In this paper, we study the theoretical justification of this +multitask finetuning approach. Our theoretical analysis reveals that with a +diverse set of related tasks, this multitask finetuning leads to reduced error +in the target task, in comparison to directly adapting the same pretrained +model. We quantify the relationship between finetuning tasks and target tasks +by diversity and consistency metrics, and further propose a practical task +selection algorithm. We substantiate our theoretical claims with extensive +empirical evidence. Further, we present results affirming our task selection +algorithm adeptly chooses related finetuning tasks, providing advantages to the +model performance on target tasks. We believe our study shed new light on the +effective adaptation of foundation models to new tasks that lack abundant +labels. Our code is available at +https://github.com/OliverXUZY/Foudation-Model_Multitask.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL']" +DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion,Tom Van Wouwe · Seunghwan Lee · Antoine Falisse · Scott Delp · Karen Liu,https://diffusionposer.github.io/,https://arxiv.org/abs/2308.16682,,,DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion,"Motion capture from a limited number of body-worn sensors, such as inertial +measurement units (IMUs) and pressure insoles, has important applications in +health, human performance, and entertainment. Recent work has focused on +accurately reconstructing whole-body motion from a specific sensor +configuration using six IMUs. While a common goal across applications is to use +the minimal number of sensors to achieve required accuracy, the optimal +arrangement of the sensors might differ from application to application. We +propose a single diffusion model, DiffusionPoser, which reconstructs human +motion in real-time from an arbitrary combination of sensors, including IMUs +placed at specified locations, and, pressure insoles. Unlike existing methods, +our model grants users the flexibility to determine the number and arrangement +of sensors tailored to the specific activity of interest, without the need for +retraining. A novel autoregressive inferencing scheme ensures real-time motion +reconstruction that closely aligns with measured sensor signals. The generative +nature of DiffusionPoser ensures realistic behavior, even for +degrees-of-freedom not directly measured. Qualitative results can be found on +our website: https://diffusionposer.github.io/.",cs.CV,['cs.CV'] +DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning,Haoran Xu · Peixi Peng · Guang Tan · Yuan Li · Xinhai Xu · Yonghong Tian,https://github.com/kyoran/DMR,,https://link.springer.com/article/10.1007/s11704-023-2444-y,,,,,nan +Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions,Runhao Zeng · Xiaoyong Chen · Jiaming Liang · Huisi Wu · Guang-Zhong Cao · Yong Guo, ,https://arxiv.org/abs/2403.20254,,2403.20254.pdf,Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions,"Temporal action detection (TAD) aims to locate action positions and recognize +action categories in long-term untrimmed videos. Although many methods have +achieved promising results, their robustness has not been thoroughly studied. +In practice, we observe that temporal information in videos can be occasionally +corrupted, such as missing or blurred frames. Interestingly, existing methods +often incur a significant performance drop even if only one frame is affected. +To formally evaluate the robustness, we establish two temporal corruption +robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, +we extensively analyze the robustness of seven leading TAD methods and obtain +some interesting findings: 1) Existing methods are particularly vulnerable to +temporal corruptions, and end-to-end methods are often more susceptible than +those with a pre-trained feature extractor; 2) Vulnerability mainly comes from +localization error rather than classification error; 3) When corruptions occur +in the middle of an action instance, TAD models tend to yield the largest +performance drop. Besides building a benchmark, we further develop a simple but +effective robust training method to defend against temporal corruptions, +through the FrameDrop augmentation and Temporal-Robust Consistency loss. +Remarkably, our approach not only improves robustness but also yields promising +improvements on clean data. We believe that this study will serve as a +benchmark for future research in robust video analysis. Source code and models +are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.",cs.CV,['cs.CV'] +Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera,Jiye Lee · Hanbyul Joo,https://jiyewise.github.io/projects/MocapEvery,https://arxiv.org/abs/2401.00847,,2401.00847.pdf,Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera,"We present a lightweight and affordable motion capture method based on two +smartwatches and a head-mounted camera. In contrast to the existing approaches +that use six or more expert-level IMU devices, our approach is much more +cost-effective and convenient. Our method can make wearable motion capture +accessible to everyone everywhere, enabling 3D full-body motion capture in +diverse environments. As a key idea to overcome the extreme sparsity and +ambiguities of sensor inputs with different modalities, we integrate 6D head +poses obtained from the head-mounted cameras for motion estimation. To enable +capture in expansive indoor and outdoor scenes, we propose an algorithm to +track and update floor level changes to define head poses, coupled with a +multi-stage Transformer-based regression module. We also introduce novel +strategies leveraging visual cues of egocentric images to further enhance the +motion capture quality while reducing ambiguities. We demonstrate the +performance of our method on various challenging scenarios, including complex +outdoor environments and everyday motions including object interactions and +social interactions among multiple individuals.",cs.CV,"['cs.CV', 'cs.GR']" +SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design,Seokju Yun · Youngmin Ro,https://github.com/ysj9909/SHViT,https://arxiv.org/abs/2401.16456,,2401.16456.pdf,SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design,"Recently, efficient Vision Transformers have shown great performance with low +latency on resource-constrained devices. Conventionally, they use 4x4 patch +embeddings and a 4-stage structure at the macro level, while utilizing +sophisticated attention with multi-head configuration at the micro level. This +paper aims to address computational redundancy at all design levels in a +memory-efficient manner. We discover that using larger-stride patchify stem not +only reduces memory access costs but also achieves competitive performance by +leveraging token representations with reduced spatial redundancy from the early +stages. Furthermore, our preliminary analyses suggest that attention layers in +the early stages can be substituted with convolutions, and several attention +heads in the latter stages are computationally redundant. To handle this, we +introduce a single-head attention module that inherently prevents head +redundancy and simultaneously boosts accuracy by parallelly combining global +and local information. Building upon our solutions, we introduce SHViT, a +Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy +tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x +faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, +respectively, while being 1.3% more accurate. For object detection and instance +segmentation on MS COCO using Mask-RCNN head, our model achieves performance +comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone +latency on GPU and mobile device, respectively.",cs.CV,['cs.CV'] +Improved Self-Training for Test-Time Adaptation,Jing Ma, ,https://arxiv.org/abs/2309.14949v1,,2309.14949v1.pdf,Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization,"Test-Time Adaptation aims to adapt source domain model to testing data at +inference stage with success demonstrated in adapting to unseen corruptions. +However, these attempts may fail under more challenging real-world scenarios. +Existing works mainly consider real-world test-time adaptation under non-i.i.d. +data stream and continual domain shift. In this work, we first complement the +existing real-world TTA protocol with a globally class imbalanced testing set. +We demonstrate that combining all settings together poses new challenges to +existing methods. We argue the failure of state-of-the-art methods is first +caused by indiscriminately adapting normalization layers to imbalanced testing +data. To remedy this shortcoming, we propose a balanced batchnorm layer to swap +out the regular batchnorm at inference stage. The new batchnorm layer is +capable of adapting without biasing towards majority classes. We are further +inspired by the success of self-training~(ST) in learning from unlabeled data +and adapt ST for test-time adaptation. However, ST alone is prone to over +adaption which is responsible for the poor performance under continual domain +shift. Hence, we propose to improve self-training under continual domain shift +by regularizing model updates with an anchored loss. The final TTA model, +termed as TRIBE, is built upon a tri-net architecture with balanced batchnorm +layers. We evaluate TRIBE on four datasets representing real-world TTA +settings. TRIBE consistently achieves the state-of-the-art performance across +multiple evaluation protocols. The code is available at +\url{https://github.com/Gorilla-Lab-SCUT/TRIBE}.",cs.LG,"['cs.LG', 'cs.CV']" +APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation,Weizhao He · Yang Zhang · Wei Zhuo · Linlin Shen · Jiaqi Yang · Songhe Deng · Liang Sun, ,https://arxiv.org/abs/2405.15265,,2405.15265.pdf,Cross-Domain Few-Shot Semantic Segmentation via Doubly Matching Transformation,"Cross-Domain Few-shot Semantic Segmentation (CD-FSS) aims to train +generalized models that can segment classes from different domains with a few +labeled images. Previous works have proven the effectiveness of feature +transformation in addressing CD-FSS. However, they completely rely on support +images for feature transformation, and repeatedly utilizing a few support +images for each class may easily lead to overfitting and overlooking +intra-class appearance differences. In this paper, we propose a Doubly Matching +Transformation-based Network (DMTNet) to solve the above issue. Instead of +completely relying on support images, we propose Self-Matching Transformation +(SMT) to construct query-specific transformation matrices based on query images +themselves to transform domain-specific query features into domain-agnostic +ones. Calculating query-specific transformation matrices can prevent +overfitting, especially for the meta-testing stage where only one or several +images are used as support images to segment hundreds or thousands of images. +After obtaining domain-agnostic features, we exploit a Dual Hypercorrelation +Construction (DHC) module to explore the hypercorrelations between the query +image with the foreground and background of the support image, based on which +foreground and background prediction maps are generated and supervised, +respectively, to enhance the segmentation result. In addition, we propose a +Test-time Self-Finetuning (TSF) strategy to more accurately self-tune the query +prediction in unseen domains. Extensive experiments on four popular datasets +show that DMTNet achieves superior performance over state-of-the-art +approaches. Code is available at https://github.com/ChenJiayi68/DMTNet.",cs.CV,['cs.CV'] +CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation,Bo-Yuan Sun · Yuqi Yang · Le Zhang · Ming-Ming Cheng · Qibin Hou,https://github.com/BBBBchan/CorrMatch,https://arxiv.org/abs/2306.04300v3,,2306.04300v3.pdf,CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation,"This paper presents a simple but performant semi-supervised semantic +segmentation approach, called CorrMatch. Previous approaches mostly employ +complicated training strategies to leverage unlabeled data but overlook the +role of correlation maps in modeling the relationships between pairs of +locations. We observe that the correlation maps not only enable clustering +pixels of the same category easily but also contain good shape information, +which previous works have omitted. Motivated by these, we aim to improve the +use efficiency of unlabeled data by designing two novel label propagation +strategies. First, we propose to conduct pixel propagation by modeling the +pairwise similarities of pixels to spread the high-confidence pixels and dig +out more. Then, we perform region propagation to enhance the pseudo labels with +accurate class-agnostic masks extracted from the correlation maps. CorrMatch +achieves great performance on popular segmentation benchmarks. Taking the +DeepLabV3+ with ResNet-101 backbone as our segmentation model, we receive a +76%+ mIoU score on the Pascal VOC 2012 dataset with only 92 annotated images. +Code is available at https://github.com/BBBBchan/CorrMatch.",cs.CV,['cs.CV'] +PixelLM: Pixel Reasoning with Large Multimodal Model,Zhongwei Ren · Zhicheng Huang · Yunchao Wei · Yao Zhao · Dongmei Fu · Jiashi Feng · Xiaojie Jin, ,https://arxiv.org/abs/2312.02228,,2312.02228.pdf,PixelLM: Pixel Reasoning with Large Multimodal Model,"While large multimodal models (LMMs) have achieved remarkable progress, +generating pixel-level masks for image reasoning tasks involving multiple +open-world targets remains a challenge. To bridge this gap, we introduce +PixelLM, an effective and efficient LMM for pixel-level reasoning and +understanding. Central to PixelLM is a novel, lightweight pixel decoder and a +comprehensive segmentation codebook. The decoder efficiently produces masks +from the hidden embeddings of the codebook tokens, which encode detailed +target-relevant information. With this design, PixelLM harmonizes with the +structure of popular LMMs and avoids the need for additional costly +segmentation models. Furthermore, we propose a target refinement loss to +enhance the model's ability to differentiate between multiple targets, leading +to substantially improved mask quality. To advance research in this area, we +construct MUSE, a high-quality multi-target reasoning segmentation benchmark. +PixelLM excels across various pixel-level image reasoning and understanding +tasks, outperforming well-established methods in multiple benchmarks, including +MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm +the efficacy of each proposed component. All code, models, and datasets will be +publicly available.",cs.CV,['cs.CV'] +EGTR: Extracting Graph from Transformer for Scene Graph Generation,Jinbae Im · JeongYeon Nam · Nokyung Park · Hyungmin Lee · Seunghyun Park,https://github.com/naver-ai/egtr,https://arxiv.org/abs/2404.02072,,2404.02072.pdf,EGTR: Extracting Graph from Transformer for Scene Graph Generation,"Scene Graph Generation (SGG) is a challenging task of detecting objects and +predicting relationships between objects. After DETR was developed, one-stage +SGG models based on a one-stage object detector have been actively studied. +However, complex modeling is used to predict the relationship between objects, +and the inherent relationship between object queries learned in the multi-head +self-attention of the object detector has been neglected. We propose a +lightweight one-stage SGG model that extracts the relation graph from the +various relationships learned in the multi-head self-attention layers of the +DETR decoder. By fully utilizing the self-attention by-products, the relation +graph can be extracted effectively with a shallow relation extraction head. +Considering the dependency of the relation extraction task on the object +detection task, we propose a novel relation smoothing technique that adjusts +the relation label adaptively according to the quality of the detected objects. +By the relation smoothing, the model is trained according to the continuous +curriculum that focuses on object detection task at the beginning of training +and performs multi-task learning as the object detection performance gradually +improves. Furthermore, we propose a connectivity prediction task that predicts +whether a relation exists between object pairs as an auxiliary task of the +relation extraction. We demonstrate the effectiveness and efficiency of our +method for the Visual Genome and Open Image V6 datasets. Our code is publicly +available at https://github.com/naver-ai/egtr.",cs.CV,"['cs.CV', 'cs.LG']" +Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning,Rongjie Li · Yu Wu · Xuming He, ,https://arxiv.org/abs/2404.00909v1,,2404.00909v1.pdf,Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning,"Generative vision-language models (VLMs) have shown impressive performance in +zero-shot vision-language tasks like image captioning and visual question +answering. However, improving their zero-shot reasoning typically requires +second-stage instruction tuning, which relies heavily on human-labeled or large +language model-generated annotation, incurring high labeling costs. To tackle +this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a +novel pre-training task designed to enhance VLMs' zero-shot performance without +the need for labeled task-aware data. The ICCC task compels VLMs to rectify +mismatches between visual and language concepts, thereby enhancing instruction +following and text generation conditioned on visual inputs. Leveraging language +structure and a lightweight dependency parser, we construct data samples of +ICCC task from image-text datasets with low labeling and computation costs. +Experimental results on BLIP-2 and InstructBLIP demonstrate significant +improvements in zero-shot image-text generation-based VL tasks through ICCC +instruction tuning.",cs.CV,['cs.CV'] +GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs,Mustafa Munir · William Avery · Md Mostafijur Rahman · Radu Marculescu, ,https://arxiv.org/abs/2405.06849,,2405.06849.pdf,GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs,"Vision graph neural networks (ViG) offer a new avenue for exploration in +computer vision. A major bottleneck in ViGs is the inefficient k-nearest +neighbor (KNN) operation used for graph construction. To solve this issue, we +propose a new method for designing ViGs, Dynamic Axial Graph Construction +(DAGC), which is more efficient than KNN as it limits the number of considered +graph connections made within an image. Additionally, we propose a novel +CNN-GNN architecture, GreedyViG, which uses DAGC. Extensive experiments show +that GreedyViG beats existing ViG, CNN, and ViT architectures in terms of +accuracy, GMACs, and parameters on image classification, object detection, +instance segmentation, and semantic segmentation tasks. Our smallest model, +GreedyViG-S, achieves 81.1% top-1 accuracy on ImageNet-1K, 2.9% higher than +Vision GNN and 2.2% higher than Vision HyperGraph Neural Network (ViHGNN), with +less GMACs and a similar number of parameters. Our largest model, GreedyViG-B +obtains 83.9% top-1 accuracy, 0.2% higher than Vision GNN, with a 66.6% +decrease in parameters and a 69% decrease in GMACs. GreedyViG-B also obtains +the same accuracy as ViHGNN with a 67.3% decrease in parameters and a 71.3% +decrease in GMACs. Our work shows that hybrid CNN-GNN architectures not only +provide a new avenue for designing efficient models, but that they can also +exceed the performance of current state-of-the-art models.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods,Mingqi Jiang · Saeed Khorram · Li Fuxin,https://mingqij.github.io/projects/cdmmtc,,https://www.nature.com/articles/s41598-024-59384-x,,,,,nan +LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation,Kibum Kim · Kanghoon Yoon · Jaehyeong Jeon · Yeonjun In · Jinyoung Moon · Donghyun Kim · Chanyoung Park, ,https://arxiv.org/abs/2310.10404,,2310.10404.pdf,LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation,"Weakly-Supervised Scene Graph Generation (WSSGG) research has recently +emerged as an alternative to the fully-supervised approach that heavily relies +on costly annotations. In this regard, studies on WSSGG have utilized image +captions to obtain unlocalized triplets while primarily focusing on grounding +the unlocalized triplets over image regions. However, they have overlooked the +two issues involved in the triplet formation process from the captions: 1) +Semantic over-simplification issue arises when extracting triplets from +captions, where fine-grained predicates in captions are undesirably converted +into coarse-grained predicates, resulting in a long-tailed predicate +distribution, and 2) Low-density scene graph issue arises when aligning the +triplets in the caption with entity/predicate classes of interest, where many +triplets are discarded and not used in training, leading to insufficient +supervision. To tackle the two issues, we propose a new approach, i.e., Large +Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two +issues by leveraging the LLM's in-depth understanding of language and reasoning +ability during the extraction of triplets from captions and alignment of +entity/predicate classes with target data. To further engage the LLM in these +processes, we adopt the idea of Chain-of-Thought and the in-context few-shot +learning strategy. To validate the effectiveness of LLM4SGG, we conduct +extensive experiments on Visual Genome and GQA datasets, showing significant +improvements in both Recall@K and mean Recall@K compared to the +state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is +data-efficient, enabling effective model training with a small amount of +training images.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +DUSt3R: Geometric 3D Vision Made Easy,Shuzhe Wang · Vincent Leroy · Yohann Cabon · Boris Chidlovskii · Jerome Revaud, ,https://arxiv.org/abs/2312.14132v1,,2312.14132v1.pdf,DUSt3R: Geometric 3D Vision Made Easy,"Multi-view stereo reconstruction (MVS) in the wild requires to first estimate +the camera parameters e.g. intrinsic and extrinsic parameters. These are +usually tedious and cumbersome to obtain, yet they are mandatory to triangulate +corresponding pixels in 3D space, which is the core of all best performing MVS +algorithms. In this work, we take an opposite stance and introduce DUSt3R, a +radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction +of arbitrary image collections, i.e. operating without prior information about +camera calibration nor viewpoint poses. We cast the pairwise reconstruction +problem as a regression of pointmaps, relaxing the hard constraints of usual +projective camera models. We show that this formulation smoothly unifies the +monocular and binocular reconstruction cases. In the case where more than two +images are provided, we further propose a simple yet effective global alignment +strategy that expresses all pairwise pointmaps in a common reference frame. We +base our network architecture on standard Transformer encoders and decoders, +allowing us to leverage powerful pretrained models. Our formulation directly +provides a 3D model of the scene as well as depth information, but +interestingly, we can seamlessly recover from it, pixel matches, relative and +absolute camera. Exhaustive experiments on all these tasks showcase that the +proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on +monocular/multi-view depth estimation as well as relative pose estimation. In +summary, DUSt3R makes many geometric 3D vision tasks easy.",cs.CV,['cs.CV'] +Latent Modulated Function for Computational Optimal Continuous Image Representation,Zongyao He · Zhi Jin,https://github.com/HeZongyao/LMF,https://arxiv.org/abs/2404.16451,,2404.16451.pdf,Latent Modulated Function for Computational Optimal Continuous Image Representation,"The recent work Local Implicit Image Function (LIIF) and subsequent Implicit +Neural Representation (INR) based works have achieved remarkable success in +Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution +(LR) features. However, these continuous image representations typically +implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading +to a quadratic increase in computational cost and seriously hindering the +practical applications of ASSR. To tackle this problem, we propose a novel +Latent Modulated Function (LMF), which decouples the HR-HD decoding process +into shared latent decoding in LR-HD space and independent rendering in HR +Low-Dimensional (LD) space, thereby realizing the first computational optimal +paradigm of continuous image representation. Specifically, LMF utilizes an HD +MLP in latent space to generate latent modulations of each LR feature vector. +This enables a modulated LD MLP in render space to quickly adapt to any input +feature vector and perform rendering at arbitrary resolution. Furthermore, we +leverage the positive correlation between modulation intensity and input image +complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm, +offering the flexibility to adjust the decoding efficiency based on the +rendering precision. Extensive experiments demonstrate that converting existing +INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%, +accelerate inference by up to 57 times, and save up to 76% of parameters, while +maintaining competitive performance. The code is available at +https://github.com/HeZongyao/LMF.",cs.CV,"['cs.CV', 'cs.AI']" +Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding,Zhihao Yuan · Jinke Ren · Chun-Mei Feng · Hengshuang Zhao · Shuguang Cui · Zhen Li,https://curryyuan.github.io/ZSVG3D/,https://arxiv.org/abs/2311.15383,,,Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding,"3D Visual Grounding (3DVG) aims at localizing 3D object based on textual +descriptions. Conventional supervised methods for 3DVG often necessitate +extensive annotations and a predefined vocabulary, which can be restrictive. To +address this issue, we propose a novel visual programming approach for +zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language +models (LLMs). Our approach begins with a unique dialog-based method, engaging +with LLMs to establish a foundational understanding of zero-shot 3DVG. Building +on this, we design a visual program that consists of three types of modules, +i.e., view-independent, view-dependent, and functional modules. These modules, +specifically tailored for 3D scenarios, work collaboratively to perform complex +reasoning and inference. Furthermore, we develop an innovative language-object +correlation module to extend the scope of existing 3D object detectors into +open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot +approach can outperform some supervised baselines, marking a significant stride +towards effective 3DVG.",cs.CV,['cs.CV'] +NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning,Mustafa B Gurbuz · Jean Moorman · Constantine Dovrolis,https://github.com/BurakGurbuz97/NICE,https://arxiv.org/abs/2310.03898,,2310.03898.pdf,Class-Incremental Learning Using Generative Experience Replay Based on Time-aware Regularization,"Learning new tasks accumulatively without forgetting remains a critical +challenge in continual learning. Generative experience replay addresses this +challenge by synthesizing pseudo-data points for past learned tasks and later +replaying them for concurrent training along with the new tasks' data. +Generative replay is the best strategy for continual learning under a strict +class-incremental setting when certain constraints need to be met: (i) constant +model size, (ii) no pre-training dataset, and (iii) no memory buffer for +storing past tasks' data. Inspired by the biological nervous system mechanisms, +we introduce a time-aware regularization method to dynamically fine-tune the +three training objective terms used for generative replay: supervised learning, +latent regularization, and data reconstruction. Experimental results on major +benchmarks indicate that our method pushes the limit of brain-inspired +continual learners under such strict settings, improves memory retention, and +increases the average performance over continually arriving tasks.",cs.LG,['cs.LG'] +A Simple Recipe for Language-guided Domain Generalized Segmentation,Mohammad Fahes · TUAN-HUNG VU · Andrei Bursuc · Patrick Pérez · Raoul de Charette,https://astra-vision.github.io/FAMix/,https://arxiv.org/abs/2311.17922,,2311.17922.pdf,A Simple Recipe for Language-guided Domain Generalized Segmentation,"Generalization to new domains not seen during training is one of the +long-standing challenges in deploying neural networks in real-world +applications. Existing generalization techniques either necessitate external +images for augmentation, and/or aim at learning invariant representations by +imposing various alignment constraints. Large-scale pretraining has recently +shown promising generalization capabilities, along with the potential of +binding different modalities. For instance, the advent of vision-language +models like CLIP has opened the doorway for vision models to exploit the +textual modality. In this paper, we introduce a simple framework for +generalizing semantic segmentation networks by employing language as the source +of randomization. Our recipe comprises three key ingredients: (i) the +preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii) +language-driven local style augmentation, and (iii) randomization by locally +mixing the source and augmented styles during training. Extensive experiments +report state-of-the-art results on various generalization benchmarks. Code is +accessible at https://github.com/astra-vision/FAMix .",cs.CV,['cs.CV'] +Self-Calibrating Vicinal Risk Minimisation for Model Calibration,Jiawei Liu · Changkun Ye · Ruikai Cui · Nick Barnes, ,https://arxiv.org/abs/2307.13539,,2307.13539.pdf,Model Calibration in Dense Classification with Adaptive Label Perturbation,"For safety-related applications, it is crucial to produce trustworthy deep +neural networks whose prediction is associated with confidence that can +represent the likelihood of correctness for subsequent decision-making. +Existing dense binary classification models are prone to being over-confident. +To improve model calibration, we propose Adaptive Stochastic Label Perturbation +(ASLP) which learns a unique label perturbation level for each training image. +ASLP employs our proposed Self-Calibrating Binary Cross Entropy (SC-BCE) loss, +which unifies label perturbation processes including stochastic approaches +(like DisturbLabel), and label smoothing, to correct calibration while +maintaining classification rates. ASLP follows Maximum Entropy Inference of +classic statistical mechanics to maximise prediction entropy with respect to +missing information. It performs this while: (1) preserving classification +accuracy on known data as a conservative solution, or (2) specifically improves +model calibration degree by minimising the gap between the prediction accuracy +and expected confidence of the target training label. Extensive results +demonstrate that ASLP can significantly improve calibration degrees of dense +binary classification models on both in-distribution and out-of-distribution +data. The code is available on https://github.com/Carlisle-Liu/ASLP.",cs.CV,"['cs.CV', 'cs.LG']" +Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation,Junyan Wang · Zhenhong Sun · Stewart Tan · Xuanbai Chen · Weihua Chen · li · Cheng Zhang · Yang Song, ,https://arxiv.org/abs/2403.05239,,2403.05239.pdf,Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation,"Vanilla text-to-image diffusion models struggle with generating accurate +human images, commonly resulting in imperfect anatomies such as unnatural +postures or disproportionate limbs.Existing methods address this issue mostly +by fine-tuning the model with extra images or adding additional controls -- +human-centric priors such as pose or depth maps -- during the image generation +phase. This paper explores the integration of these human-centric priors +directly into the model fine-tuning stage, essentially eliminating the need for +extra conditions at the inference stage. We realize this idea by proposing a +human-centric alignment loss to strengthen human-related information from the +textual prompts within the cross-attention maps. To ensure semantic detail +richness and human structural accuracy during fine-tuning, we introduce +scale-aware and step-wise constraints within the diffusion process, according +to an in-depth analysis of the cross-attention layer. Extensive experiments +show that our method largely improves over state-of-the-art text-to-image +models to synthesize high-quality human images based on user-written prompts. +Project page: \url{https://hcplayercvpr2024.github.io}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability,Jaehui Hwang · Junghyuk Lee · Jong-Seok Lee, ,https://arxiv.org/abs/2312.10634,,2312.10634.pdf,Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability,"With the advancement of generative models, the assessment of generated images +becomes more and more important. Previous methods measure distances between +features of reference and generated images from trained vision models. In this +paper, we conduct an extensive investigation into the relationship between the +representation space and input space around generated images. We first propose +two measures related to the presence of unnatural elements within images: +complexity, which indicates how non-linear the representation space is, and +vulnerability, which is related to how easily the extracted feature changes by +adversarial input changes. Based on these, we introduce a new metric to +evaluating image-generative models called anomaly score (AS). Moreover, we +propose AS-i (anomaly score for individual images) that can effectively +evaluate generated images individually. Experimental results demonstrate the +validity of the proposed approach.",cs.CV,"['cs.CV', 'cs.LG']" +MuseChat: A Conversational Music Recommendation System for Videos,Zhikang Dong · Bin Chen · Xiulong Liu · Pawel Polak · Peng Zhang, ,https://arxiv.org/abs/2310.06282,,2310.06282.pdf,MuseChat: A Conversational Music Recommendation System for Videos,"Music recommendation for videos attracts growing interest in multi-modal +research. However, existing systems focus primarily on content compatibility, +often ignoring the users' preferences. Their inability to interact with users +for further refinements or to provide explanations leads to a less satisfying +experience. We address these issues with MuseChat, a first-of-its-kind +dialogue-based recommendation system that personalizes music suggestions for +videos. Our system consists of two key functionalities with associated modules: +recommendation and reasoning. The recommendation module takes a video along +with optional information including previous suggested music and user's +preference as inputs and retrieves an appropriate music matching the context. +The reasoning module, equipped with the power of Large Language Model +(Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable +explanation for the recommended music. To evaluate the effectiveness of +MuseChat, we build a large-scale dataset, conversational music recommendation +for videos, that simulates a two-turn interaction between a user and a +recommender based on accurate music track information. Experiment results show +that MuseChat achieves significant improvements over existing video-based music +retrieval methods as well as offers strong interpretability and +interactability.",cs.LG,"['cs.LG', 'cs.CV', 'cs.IR']" +Learning Degradation-unaware Representation with Prior-based Latent Transformations for Blind Face Restoration,Lianxin Xie · csbingbing zheng · Wen Xue · Le Jiang · Cheng Liu · Si Wu · Hau San Wong, ,https://arxiv.org/abs/2402.06106,,2402.06106.pdf,CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models,"Recent generative-prior-based methods have shown promising blind face +restoration performance. They usually project the degraded images to the latent +space and then decode high-quality faces either by single-stage latent +optimization or directly from the encoding. Generating fine-grained facial +details faithful to inputs remains a challenging problem. Most existing methods +produce either overly smooth outputs or alter the identity as they attempt to +balance between generation and reconstruction. This may be attributed to the +typical trade-off between quality and resolution in the latent space. If the +latent space is highly compressed, the decoded output is more robust to +degradations but shows worse fidelity. On the other hand, a more flexible +latent space can capture intricate facial details better, but is extremely +difficult to optimize for highly degraded faces using existing techniques. To +address these issues, we introduce a diffusion-based-prior inside a VQGAN +architecture that focuses on learning the distribution over uncorrupted latent +embeddings. With such knowledge, we iteratively recover the clean embedding +conditioning on the degraded counterpart. Furthermore, to ensure the reverse +diffusion trajectory does not deviate from the underlying identity, we train a +separate Identity Recovery Network and use its output to constrain the reverse +diffusion process. Specifically, using a learnable latent mask, we add +gradients from a face-recognition network to a subset of latent features that +correlates with the finer identity-related details in the pixel space, leaving +the other features untouched. Disentanglement between perception and fidelity +in the latent space allows us to achieve the best of both worlds. We perform +extensive evaluations on multiple real and synthetic datasets to validate the +superiority of our approach.",cs.CV,['cs.CV'] +Faces that Speak: Jointly Synthesising Talking Face and Speech from Text,Youngjoon Jang · Jihoon Kim · Junseok Ahn · Doyeop Kwak · Hongsun Yang · Yooncheol Ju · ILHWAN KIM · Byeong-Yeol Kim · Joon Chung,https://mm.kaist.ac.kr/projects/faces-that-speak/,https://arxiv.org/abs/2405.10272,,2405.10272.pdf,Faces that Speak: Jointly Synthesising Talking Face and Speech from Text,"The goal of this work is to simultaneously generate natural talking faces and +speech outputs from text. We achieve this by integrating Talking Face +Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We +address the main challenges of each task: (1) generating a range of head poses +representative of real-world scenarios, and (2) ensuring voice consistency +despite variations in facial motion for the same identity. To tackle these +issues, we introduce a motion sampler based on conditional flow matching, which +is capable of high-quality motion code generation in an efficient way. +Moreover, we introduce a novel conditioning method for the TTS system, which +utilises motion-removed features from the TFG model to yield uniform speech +outputs. Our extensive experiments demonstrate that our method effectively +creates natural-looking talking faces and speech that accurately match the +input text. To our knowledge, this is the first effort to build a multimodal +synthesis system that can generalise to unseen identities.",cs.CV,"['cs.CV', 'cs.AI', 'cs.SD', 'eess.AS', 'eess.IV']" +Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance,Yuto Enyo · Ko Nishino, ,https://arxiv.org/abs/2312.04529,,2312.04529.pdf,Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance,"Reflectance bounds the frequency spectrum of illumination in the object +appearance. In this paper, we introduce the first stochastic inverse rendering +method, which recovers the attenuated frequency spectrum of an illumination +jointly with the reflectance of an object of known geometry from a single +image. Our key idea is to solve this blind inverse problem in the reflectance +map, an appearance representation invariant to the underlying geometry, by +learning to reverse the image formation with a novel diffusion model which we +refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed +reflectance map converted and completed from the single input image, DRMNet +generates a reflectance map corresponding to a perfect mirror sphere while +jointly estimating the reflectance. The forward process can be understood as +gradually filtering a natural illumination with lower and lower frequency +reflectance and additive Gaussian noise. DRMNet learns to invert this process +with two subnetworks, IllNet and RefNet, which work in concert towards this +joint estimation. The network is trained on an extensive synthetic dataset and +is demonstrated to generalize to real images, showing state-of-the-art accuracy +on established datasets.",cs.CV,['cs.CV'] +PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild,Kun Yuan · Hongbo Liu · Mading Li · Muyi Sun · Ming Sun · Jiachao Gong · Jinhua Hao · Chao Zhou · Yansong Tang, ,https://arxiv.org/abs/2405.17765,,2405.17765.pdf,PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild,"Video quality assessment (VQA) is a challenging problem due to the numerous +factors that can affect the perceptual quality of a video, \eg, content +attractiveness, distortion type, motion pattern, and level. However, annotating +the Mean opinion score (MOS) for videos is expensive and time-consuming, which +limits the scale of VQA datasets, and poses a significant obstacle for deep +learning-based methods. In this paper, we propose a VQA method named PTM-VQA, +which leverages PreTrained Models to transfer knowledge from models pretrained +on various pre-tasks, enabling benefits for VQA from different aspects. + Specifically, we extract features of videos from different pretrained models +with frozen weights and integrate them to generate representation. Since these +models possess various fields of knowledge and are often trained with labels +irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility +(ICID) loss to impose constraints on features extracted by multiple pretrained +models. The intra-consistency constraint ensures that features extracted by +different pretrained models are in the same unified quality-aware latent space, +while the inter-divisibility introduces pseudo clusters based on the annotation +of samples and tries to separate features of samples from different clusters. +Furthermore, with a constantly growing number of pretrained models, it is +crucial to determine which models to use and how to use them. To address this +problem, we propose an efficient scheme to select suitable candidates. Models +with better clustering performance on VQA datasets are chosen to be our +candidates. Extensive experiments demonstrate the effectiveness of the proposed +method.",cs.CV,['cs.CV'] +Plug-and-Play Diffusion Distillation,Yi-Ting Hsiao · Siavash Khodadadeh · Kevin Duarte · Wei-An Lin · Hui Qu · Mingi Kwon · Ratheesh Kalarot,https://5410tiffany.github.io/plug-and-play-diffusion-distillation.github.io/,https://arxiv.org/abs/2403.12015,,2403.12015.pdf,Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation,"Diffusion models are the main driver of progress in image and video +synthesis, but suffer from slow inference speed. Distillation methods, like the +recently introduced adversarial diffusion distillation (ADD) aim to shift the +model from many-shot to single-step inference, albeit at the cost of expensive +and difficult optimization due to its reliance on a fixed pretrained DINOv2 +discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a +novel distillation approach overcoming the limitations of ADD. In contrast to +pixel-based ADD, LADD utilizes generative features from pretrained latent +diffusion models. This approach simplifies training and enhances performance, +enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to +Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the +performance of state-of-the-art text-to-image generators using only four +unguided sampling steps. Moreover, we systematically investigate its scaling +behavior and demonstrate LADD's effectiveness in various applications such as +image editing and inpainting.",cs.CV,['cs.CV'] +Masked and Shuffled Blind Spot Denoising for Real-World Images,Hamadi Chihaoui · Paolo Favaro, ,https://arxiv.org/abs/2404.09389,,2404.09389.pdf,Masked and Shuffled Blind Spot Denoising for Real-World Images,"We introduce a novel approach to single image denoising based on the Blind +Spot Denoising principle, which we call MAsked and SHuffled Blind Spot +Denoising (MASH). We focus on the case of correlated noise, which often plagues +real images. MASH is the result of a careful analysis to determine the +relationships between the level of blindness (masking) of the input and the +(unknown) noise correlation. Moreover, we introduce a shuffling technique to +weaken the local correlation of noise, which in turn yields an additional +denoising performance improvement. We evaluate MASH via extensive experiments +on real-world noisy image datasets. We demonstrate on par or better results +compared to existing self-supervised denoising methods.",cs.CV,"['cs.CV', 'cs.LG']" +Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation,Li Hu, ,https://arxiv.org/abs/2311.17117,,2311.17117.pdf,Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation,"Character Animation aims to generating character videos from still images +through driving signals. Currently, diffusion models have become the mainstream +in visual generation research, owing to their robust generative capabilities. +However, challenges persist in the realm of image-to-video, especially in +character animation, where temporally maintaining consistency with detailed +information from character remains a formidable problem. In this paper, we +leverage the power of diffusion models and propose a novel framework tailored +for character animation. To preserve consistency of intricate appearance +features from reference image, we design ReferenceNet to merge detail features +via spatial attention. To ensure controllability and continuity, we introduce +an efficient pose guider to direct character's movements and employ an +effective temporal modeling approach to ensure smooth inter-frame transitions +between video frames. By expanding the training data, our approach can animate +arbitrary characters, yielding superior results in character animation compared +to other image-to-video methods. Furthermore, we evaluate our method on +benchmarks for fashion video and human dance synthesis, achieving +state-of-the-art results.",cs.CV,['cs.CV'] +Bootstrapping SparseFormers from Vision Foundation Models,Ziteng Gao · Zhan Tong · Kevin Qinghong Lin · Joya Chen · Mike Zheng Shou,https://github.com/showlab/sparseformer,https://arxiv.org/abs/2312.01987,,2312.01987.pdf,Bootstrapping SparseFormers from Vision Foundation Models,"The recently proposed SparseFormer architecture provides an alternative +approach to visual understanding by utilizing a significantly lower number of +visual tokens via adjusting RoIs, greatly reducing computational costs while +still achieving promising performance. However, training SparseFormers from +scratch is still expensive, and scaling up the number of parameters can be +challenging. In this paper, we propose to bootstrap SparseFormers from +ViT-based vision foundation models in a simple and efficient way. Since the +majority of SparseFormer blocks are the standard transformer ones, we can +inherit weights from large-scale pre-trained vision transformers and freeze +them as much as possible. Therefore, we only need to train the +SparseFormer-specific lightweight focusing transformer to adjust token RoIs and +fine-tune a few early pre-trained blocks to align the final token +representation. In such a way, we can bootstrap SparseFormer architectures from +various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or +CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and +without labels or captions within just a few hours. As a result, the +bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% +accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from +CLIPs also demonstrates notable zero-shot performance with highly reduced +computational cost without seeing any caption during the bootstrapping +procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output +space with language without seeing a word, can serve as efficient vision +encoders in multimodal large language models. Code and models are available at +https://github.com/showlab/sparseformer",cs.CV,['cs.CV'] +Self-Supervised Dual Contouring,Ramana Sundararaman · Roman Klokov · Maks Ovsjanikov, ,https://arxiv.org/abs/2405.18131,,2405.18131.pdf,Self-Supervised Dual Contouring,"Learning-based isosurface extraction methods have recently emerged as a +robust and efficient alternative to axiomatic techniques. However, the vast +majority of such approaches rely on supervised training with axiomatically +computed ground truths, thus potentially inheriting biases and data artifacts +of the corresponding axiomatic methods. Steering away from such dependencies, +we propose a self-supervised training scheme for the Neural Dual Contouring +meshing framework, resulting in our method: Self-Supervised Dual Contouring +(SDC). Instead of optimizing predicted mesh vertices with supervised training, +we use two novel self-supervised loss functions that encourage the consistency +between distances to the generated mesh up to the first order. Meshes +reconstructed by SDC surpass existing data-driven methods in capturing +intricate details while being more robust to possible irregularities in the +input. Furthermore, we use the same self-supervised training objective linking +inferred mesh and input SDF, to regularize the training process of Deep +Implicit Networks (DINs). We demonstrate that the resulting DINs produce +higher-quality implicit functions, ultimately leading to more accurate and +detail-preserving surfaces compared to prior baselines for different input +modalities. Finally, we demonstrate that our self-supervised losses improve +meshing performance in the single-view reconstruction task by enabling joint +training of predicted SDF and resulting output mesh. We open-source our code at +https://github.com/Sentient07/SDC",cs.CV,['cs.CV'] +Wired Perspectives: Multi-View Wire Art Embraces Generative AI,Zhiyu Qu · LAN YANG · Honggang Zhang · Tao Xiang · Kaiyue Pang · Yi-Zhe Song,https://dreamwireart.github.io/,https://arxiv.org/abs/2311.15421,,,Wired Perspectives: Multi-View Wire Art Embraces Generative AI,"Creating multi-view wire art (MVWA), a static 3D sculpture with diverse +interpretations from different viewpoints, is a complex task even for skilled +artists. In response, we present DreamWire, an AI system enabling everyone to +craft MVWA easily. Users express their vision through text prompts or +scribbles, freeing them from intricate 3D wire organisation. Our approach +synergises 3D B\'ezier curves, Prim's algorithm, and knowledge distillation +from diffusion models or their variants (e.g., ControlNet). This blend enables +the system to represent 3D wire art, ensuring spatial continuity and overcoming +data scarcity. Extensive evaluation and analysis are conducted to shed insight +on the inner workings of the proposed system, including the trade-off between +connectivity and visual aesthetics.",cs.CV,"['cs.CV', 'cs.AI']" +SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation,Yuxuan Zhang · Yiren Song · Jiaming Liu · Rui Wang · Jinpeng Yu · Hao Tang · Huaxia Li · Xu Tang · Yao Hu · Han Pan · Zhongliang Jing,https://ssr-encoder.github.io/,https://arxiv.org/abs/2312.16272,,2312.16272.pdf,SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation,"Recent advancements in subject-driven image generation have led to zero-shot +generation, yet precise selection and focus on crucial subject representations +remain challenging. Addressing this, we introduce the SSR-Encoder, a novel +architecture designed for selectively capturing any subject from single or +multiple reference images. It responds to various query modalities including +text and masks, without necessitating test-time fine-tuning. The SSR-Encoder +combines a Token-to-Patch Aligner that aligns query inputs with image patches +and a Detail-Preserving Subject Encoder for extracting and preserving fine +features of the subjects, thereby generating subject embeddings. These +embeddings, used in conjunction with original text embeddings, condition the +generation process. Characterized by its model generalizability and efficiency, +the SSR-Encoder adapts to a range of custom models and control modules. +Enhanced by the Embedding Consistency Regularization Loss for improved +training, our extensive experiments demonstrate its effectiveness in versatile +and high-quality image generation, indicating its broad applicability. Project +page: https://ssr-encoder.github.io",cs.CV,['cs.CV'] +Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval,Rohan Sarkar · Avinash Kak, ,https://arxiv.org/abs/2403.00272,,2403.00272.pdf,Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval,"In the context of pose-invariant object recognition and retrieval, we +demonstrate that it is possible to achieve significant improvements in +performance if both the category-based and the object-identity-based embeddings +are learned simultaneously during training. In hindsight, that sounds intuitive +because learning about the categories is more fundamental than learning about +the individual objects that correspond to those categories. However, to the +best of what we know, no prior work in pose-invariant learning has demonstrated +this effect. This paper presents an attention-based dual-encoder architecture +with specially designed loss functions that optimize the inter- and intra-class +distances simultaneously in two different embedding spaces, one for the +category embeddings and the other for the object-level embeddings. The loss +functions we have proposed are pose-invariant ranking losses that are designed +to minimize the intra-class distances and maximize the inter-class distances in +the dual representation spaces. We demonstrate the power of our approach with +three challenging multi-view datasets, ModelNet-40, ObjectPI, and FG3D. With +our dual approach, for single-view object recognition, we outperform the +previous best by 20.0% on ModelNet40, 2.0% on ObjectPI, and 46.5% on FG3D. On +the other hand, for single-view object retrieval, we outperform the previous +best by 33.7% on ModelNet40, 18.8% on ObjectPI, and 56.9% on FG3D.",cs.CV,"['cs.CV', 'cs.IR', 'cs.LG']" +Symphonize 3D Semantic Scene Completion with Contextual Instance Queries,Haoyi Jiang · Tianheng Cheng · Naiyu Gao · Haoyang Zhang · Tianwei Lin · Wenyu Liu · Xinggang Wang, ,https://arxiv.org/abs/2306.15670v2,,2306.15670v2.pdf,Symphonize 3D Semantic Scene Completion with Contextual Instance Queries,"`3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal +undertaking in autonomous driving, aiming to predict voxel occupancy within +volumetric scenes. However, prevailing methodologies primarily focus on +voxel-wise feature aggregation, while neglecting instance semantics and scene +context. In this paper, we present a novel paradigm termed Symphonies +(Scene-from-Insts), that delves into the integration of instance queries to +orchestrate 2D-to-3D reconstruction and 3D scene modeling. Leveraging our +proposed Serial Instance-Propagated Attentions, Symphonies dynamically encodes +instance-centric semantics, facilitating intricate interactions between +image-based and volumetric domains. Simultaneously, Symphonies enables holistic +scene comprehension by capturing context through the efficient fusion of +instance queries, alleviating geometric ambiguity such as occlusion and +perspective errors through contextual scene reasoning. Experimental results +demonstrate that Symphonies achieves state-of-the-art performance on +challenging benchmarks SemanticKITTI and SSCBench-KITTI-360, yielding +remarkable mIoU scores of 15.04 and 18.58, respectively. These results showcase +the paradigm's promising advancements. The code is available at +https://github.com/hustvl/Symphonies.",cs.CV,"['cs.CV', 'cs.RO']" +KeyPoint Relative Position Encoding for Face Recognition,Minchul Kim · Feng Liu · Yiyang Su · Anil Jain · Xiaoming Liu, ,https://arxiv.org/abs/2403.14852,,2403.14852.pdf,KeyPoint Relative Position Encoding for Face Recognition,"In this paper, we address the challenge of making ViT models more robust to +unseen affine transformations. Such robustness becomes useful in various +recognition tasks such as face recognition when image alignment failures occur. +We propose a novel method called KP-RPE, which leverages key points +(e.g.~facial landmarks) to make ViT more resilient to scale, translation, and +pose variations. We begin with the observation that Relative Position Encoding +(RPE) is a good way to bring affine transform generalization to ViTs. RPE, +however, can only inject the model with prior knowledge that nearby pixels are +more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this +principle, where the significance of pixels is not solely dictated by their +proximity but also by their relative positions to specific keypoints within the +image. By anchoring the significance of pixels around keypoints, the model can +more effectively retain spatial relationships, even when those relationships +are disrupted by affine transformations. We show the merit of KP-RPE in face +and gait recognition. The experimental results demonstrate the effectiveness in +improving face recognition performance from low-quality images, particularly +where alignment is prone to failure. Code and pre-trained models are available.",cs.CV,['cs.CV'] +Feedback-Guided Autonomous Driving,Jimuyang Zhang · Zanming Huang · Arijit Ray · Eshed Ohn-Bar, ,https://arxiv.org/abs/2306.10014,,2306.10014.pdf,Coaching a Teachable Student,"We propose a novel knowledge distillation framework for effectively teaching +a sensorimotor student agent to drive from the supervision of a privileged +teacher agent. Current distillation for sensorimotor agents methods tend to +result in suboptimal learned driving behavior by the student, which we +hypothesize is due to inherent differences between the input, modeling +capacity, and optimization processes of the two agents. We develop a novel +distillation scheme that can address these limitations and close the gap +between the sensorimotor agent and its privileged teacher. Our key insight is +to design a student which learns to align their input features with the +teacher's privileged Bird's Eye View (BEV) space. The student then can benefit +from direct supervision by the teacher over the internal representation +learning. To scaffold the difficult sensorimotor learning task, the student +model is optimized via a student-paced coaching mechanism with various +auxiliary supervision. We further propose a high-capacity imitation learned +privileged agent that surpasses prior privileged agents in CARLA and ensures +the student learns safe driving behavior. Our proposed sensorimotor agent +results in a robust image-based behavior cloning agent in CARLA, improving over +current models by over 20.6% in driving score without requiring LiDAR, +historical observations, ensemble of models, on-policy data aggregation or +reinforcement learning.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']" +Look-Up Table Compression for Efficient Image Restoration,Yinglong Li · Jiacheng Li · Zhiwei Xiong, ,https://arxiv.org/abs/2307.08544,,2307.08544.pdf,Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution,"Look-up table(LUT)-based methods have shown the great efficacy in single +image super-resolution (SR) task. However, previous methods ignore the +essential reason of restricted receptive field (RF) size in LUT, which is +caused by the interaction of space and channel features in vanilla convolution. +They can only increase the RF at the cost of linearly increasing LUT size. To +enlarge RF with contained LUT sizes, we propose a novel Reconstructed +Convolution(RC) module, which decouples channel-wise and spatial calculation. +It can be formulated as $n^2$ 1D LUTs to maintain $n\times n$ receptive field, +which is obviously smaller than $n\times n$D LUT formulated before. The LUT +generated by our RC module reaches less than 1/10000 storage compared with +SR-LUT baseline. The proposed Reconstructed Convolution module based LUT +method, termed as RCLUT, can enlarge the RF size by 9 times than the +state-of-the-art LUT-based SR method and achieve superior performance on five +popular benchmark dataset. Moreover, the efficient and robust RC module can be +used as a plugin to improve other LUT-based SR methods. The code is available +at https://github.com/liuguandu/RC-LUT.",eess.IV,"['eess.IV', 'cs.CV']" +WaveMo: Learning Wavefront Modulations to See Through Scattering,Mingyang Xie · Haiyun Guo · Brandon Y. Feng · Lingbo Jin · Ashok Veeraraghavan · Christopher Metzler,https://wavemo-2024.github.io/,https://arxiv.org/abs/2404.07985v1,,2404.07985v1.pdf,WaveMo: Learning Wavefront Modulations to See Through Scattering,"Imaging through scattering media is a fundamental and pervasive challenge in +fields ranging from medical diagnostics to astronomy. A promising strategy to +overcome this challenge is wavefront modulation, which induces measurement +diversity during image acquisition. Despite its importance, designing optimal +wavefront modulations to image through scattering remains under-explored. This +paper introduces a novel learning-based framework to address the gap. Our +approach jointly optimizes wavefront modulations and a computationally +lightweight feedforward ""proxy"" reconstruction network. This network is trained +to recover scenes obscured by scattering, using measurements that are modified +by these modulations. The learned modulations produced by our framework +generalize effectively to unseen scattering scenarios and exhibit remarkable +versatility. During deployment, the learned modulations can be decoupled from +the proxy network to augment other more computationally expensive restoration +algorithms. Through extensive experiments, we demonstrate our approach +significantly advances the state of the art in imaging through scattering +media. Our project webpage is at https://wavemo-2024.github.io/.",cs.CV,"['cs.CV', 'eess.IV']" +Constrained Layout Generation with Factor Graphs,Mohammed Haroon Dupty · Yanfei Dong · Sicong Leng · Guoji Fu · Yong Liang Goh · Wei Lu · Wee Sun Lee, ,https://arxiv.org/abs/2404.00385,,2404.00385.pdf,Constrained Layout Generation with Factor Graphs,"This paper addresses the challenge of object-centric layout generation under +spatial constraints, seen in multiple domains including floorplan design +process. The design process typically involves specifying a set of spatial +constraints that include object attributes like size and inter-object relations +such as relative positioning. Existing works, which typically represent objects +as single nodes, lack the granularity to accurately model complex interactions +between objects. For instance, often only certain parts of an object, like a +room's right wall, interact with adjacent objects. To address this gap, we +introduce a factor graph based approach with four latent variable nodes for +each room, and a factor node for each constraint. The factor nodes represent +dependencies among the variables to which they are connected, effectively +capturing constraints that are potentially of a higher order. We then develop +message-passing on the bipartite graph, forming a factor graph neural network +that is trained to produce a floorplan that aligns with the desired +requirements. Our approach is simple and generates layouts faithful to the user +requirements, demonstrated by a large improvement in IOU scores over existing +methods. Additionally, our approach, being inferential and accurate, is +well-suited to the practical human-in-the-loop design process where +specifications evolve iteratively, offering a practical and powerful tool for +AI-guided design.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Prompt Learning via Meta-Regularization,Jinyoung Park · Juyeon Ko · Hyunwoo J. Kim, ,https://arxiv.org/abs/2404.00851,,2404.00851.pdf,Prompt Learning via Meta-Regularization,"Pre-trained vision-language models have shown impressive success on various +computer vision tasks with their zero-shot generalizability. Recently, prompt +learning approaches have been explored to efficiently and effectively adapt the +vision-language models to a variety of downstream tasks. However, most existing +prompt learning methods suffer from task overfitting since the general +knowledge of the pre-trained vision language models is forgotten while the +prompts are finetuned on a small data set from a specific target task. To +address this issue, we propose a Prompt Meta-Regularization (ProMetaR) to +improve the generalizability of prompt learning for vision-language models. +Specifically, ProMetaR meta-learns both the regularizer and the soft prompts to +harness the task-specific knowledge from the downstream tasks and task-agnostic +general knowledge from the vision-language models. Further, ProMetaR augments +the task to generate multiple virtual tasks to alleviate the meta-overfitting. +In addition, we provide the analysis to comprehend how ProMetaR improves the +generalizability of prompt tuning in the perspective of the gradient alignment. +Our extensive experiments demonstrate that our ProMetaR improves the +generalizability of conventional prompt learning methods under +base-to-base/base-to-new and domain generalization settings. The code of +ProMetaR is available at https://github.com/mlvlab/ProMetaR.",cs.CV,['cs.CV'] +Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation,Xin Fan · Xiaolin Wang · Jiaxin Gao · Jia Wang · Zhongxuan Luo · Risheng Liu, ,,https://dl.acm.org/doi/10.1145/3580305.3599452,,,,,nan +NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis,Zinuo You · Andreas Geiger · Anpei Chen,https://sinoyou.github.io/nelf-pro/,https://arxiv.org/abs/2312.13328,,2312.13328.pdf,NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis,"We present NeLF-Pro, a novel representation to model and reconstruct light +fields in diverse natural scenes that vary in extent and spatial granularity. +In contrast to previous fast reconstruction methods that represent the 3D scene +globally, we model the light field of a scene as a set of local light field +feature probes, parameterized with position and multi-channel 2D feature maps. +Our central idea is to bake the scene's light field into spatially varying +learnable representations and to query point features by weighted blending of +probes close to the camera - allowing for mipmap representation and rendering. +We introduce a novel vector-matrix-matrix (VMM) factorization technique that +effectively represents the light field feature probes as products of core +factors (i.e., VM) shared among local feature probes, and a basis factor (i.e., +M) - efficiently encoding internal relationships and patterns within the scene. +Experimentally, we demonstrate that NeLF-Pro significantly boosts the +performance of feature grid-based representations, and achieves fast +reconstruction with better rendering quality while maintaining compact +modeling. Project webpage https://sinoyou.github.io/nelf-pro/.",cs.CV,['cs.CV'] +ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers,Narges Norouzi · Svetlana Orlova · Daan de Geus · Gijs Dubbelman,https://www.tue-mps.org/ALGM/,https://arxiv.org/abs/2405.14467,,2405.14467.pdf,Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation,"Utilizing transformer architectures for semantic segmentation of +high-resolution images is hindered by the attention's quadratic computational +complexity in the number of tokens. A solution to this challenge involves +decreasing the number of tokens through token merging, which has exhibited +remarkable enhancements in inference speed, training efficiency, and memory +utilization for image classification tasks. In this paper, we explore various +token merging strategies within the framework of the Segformer architecture and +perform experiments on multiple semantic segmentation and human pose estimation +datasets. Notably, without model re-training, we, for example, achieve an +inference acceleration of 61% on the Cityscapes dataset while maintaining the +mIoU performance. Consequently, this paper facilitates the deployment of +transformer-based architectures on resource-constrained devices and in +real-time applications.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance,Junkai Fan · Jiangwei Weng · Kun Wang · Yijun Yang · Jianjun Qian · Jun Li · Jian Yang,https://fanjunkai1.github.io/projectpage/DVD/index.html,https://arxiv.org/abs/2405.09996,,2405.09996.pdf,Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance,"Real driving-video dehazing poses a significant challenge due to the inherent +difficulty in acquiring precisely aligned hazy/clear video pairs for effective +model training, especially in dynamic driving scenarios with unpredictable +weather conditions. In this paper, we propose a pioneering approach that +addresses this challenge through a nonaligned regularization strategy. Our core +concept involves identifying clear frames that closely match hazy frames, +serving as references to supervise a video dehazing network. Our approach +comprises two key components: reference matching and video dehazing. Firstly, +we introduce a non-aligned reference frame matching module, leveraging an +adaptive sliding window to match high-quality reference frames from clear +videos. Video dehazing incorporates flow-guided cosine attention sampler and +deformable cosine attention fusion modules to enhance spatial multiframe +alignment and fuse their improved information. To validate our approach, we +collect a GoProHazy dataset captured effortlessly with GoPro cameras in diverse +rural and urban road environments. Extensive experiments demonstrate the +superiority of the proposed method over current state-of-the-art methods in the +challenging task of real driving-video dehazing. Project page.",cs.CV,['cs.CV'] +Koala: Key frame-conditioned long video-LLM,Reuben Tan · Ximeng Sun · Ping Hu · Jui-Hsien Wang · Hanieh Deilamsalehy · Bryan A. Plummer · Bryan Russell · Kate Saenko, ,https://arxiv.org/abs/2404.04346,,2404.04346.pdf,Koala: Key frame-conditioned long video-LLM,"Long video question answering is a challenging task that involves recognizing +short-term activities and reasoning about their fine-grained relationships. +State-of-the-art video Large Language Models (vLLMs) hold promise as a viable +solution due to their demonstrated emergent capabilities on new tasks. However, +despite being trained on millions of short seconds-long videos, vLLMs are +unable to understand minutes-long videos and accurately answer questions about +them. To address this limitation, we propose a lightweight and self-supervised +approach, Key frame-conditioned long video-LLM (Koala), that introduces +learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to +longer videos. Our approach introduces two new tokenizers that condition on +visual tokens computed from sparse video key frames for understanding short and +long video moments. We train our proposed approach on HowTo100M and demonstrate +its effectiveness on zero-shot long video understanding benchmarks, where it +outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across +all tasks. Surprisingly, we also empirically show that our approach not only +helps a pretrained vLLM to understand long videos but also improves its +accuracy on short-term action recognition.",cs.CV,['cs.CV'] +Hyperspherical Classification with Dynamic Label-to-Prototype Assignment,Mohammad Saadabadi Saadabadi · Ali Dabouei · Sahar Rahimi Malakshan · Nasser Nasrabadi, ,https://arxiv.org/abs/2403.16937,,2403.16937.pdf,Hyperspherical Classification with Dynamic Label-to-Prototype Assignment,"Aiming to enhance the utilization of metric space by the parametric softmax +classifier, recent studies suggest replacing it with a non-parametric +alternative. Although a non-parametric classifier may provide better metric +space utilization, it introduces the challenge of capturing inter-class +relationships. A shared characteristic among prior non-parametric classifiers +is the static assignment of labels to prototypes during the training, ie, each +prototype consistently represents a class throughout the training course. +Orthogonal to previous works, we present a simple yet effective method to +optimize the category assigned to each prototype (label-to-prototype +assignment) during the training. To this aim, we formalize the problem as a +two-step optimization objective over network parameters and label-to-prototype +assignment mapping. We solve this optimization using a sequential combination +of gradient descent and Bipartide matching. We demonstrate the benefits of the +proposed approach by conducting experiments on balanced and long-tail +classification problems using different backbone network architectures. In +particular, our method outperforms its competitors by 1.22\% accuracy on +CIFAR-100, and 2.15\% on ImageNet-200 using a metric space dimension half of +the size of its competitors. Code: +https://github.com/msed-Ebrahimi/DL2PA_CVPR24",cs.CV,['cs.CV'] +From Activation to Initialization: Scaling Insights for Optimizing Neural Fields,Hemanth Saratchandran · Sameera Ramasinghe · Simon Lucey, ,https://arxiv.org/abs/2403.19205,,2403.19205.pdf,From Activation to Initialization: Scaling Insights for Optimizing Neural Fields,"In the realm of computer vision, Neural Fields have gained prominence as a +contemporary tool harnessing neural networks for signal representation. Despite +the remarkable progress in adapting these networks to solve a variety of +problems, the field still lacks a comprehensive theoretical framework. This +article aims to address this gap by delving into the intricate interplay +between initialization and activation, providing a foundational basis for the +robust optimization of Neural Fields. Our theoretical insights reveal a +deep-seated connection among network initialization, architectural choices, and +the optimization process, emphasizing the need for a holistic approach when +designing cutting-edge Neural Fields.",cs.CV,"['cs.CV', 'cs.LG']" +Tune-An-Ellipse: CLIP Has Potential to Find What You Want,Jinheng Xie · Songhe Deng · Bing Li · Haozhe Liu · Yawen Huang · Yefeng Zheng · Jürgen Schmidhuber · Bernard Ghanem · Linlin Shen · Mike Zheng Shou, ,,https://cloud.tencent.com/developer/article/2396040,,,,,nan +Neural Lineage,Runpeng Yu · Xinchao Wang, ,https://arxiv.org/abs/2312.02470v1,,2312.02470v1.pdf,Generator Born from Classifier,"In this paper, we make a bold attempt toward an ambitious task: given a +pre-trained classifier, we aim to reconstruct an image generator, without +relying on any data samples. From a black-box perspective, this challenge seems +intractable, since it inevitably involves identifying the inverse function for +a classifier, which is, by nature, an information extraction process. As such, +we resort to leveraging the knowledge encapsulated within the parameters of the +neural network. Grounded on the theory of Maximum-Margin Bias of gradient +descent, we propose a novel learning paradigm, in which the generator is +trained to ensure that the convergence conditions of the network parameters are +satisfied over the generated distribution of the samples. Empirical validation +from various image generation tasks substantiates the efficacy of our strategy.",cs.LG,"['cs.LG', 'cs.CV']" +Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels,Tianming Liang · Chaolei Tan · Beihao Xia · Wei-Shi Zheng · Jian-Fang Hu, ,https://arxiv.org/abs/2403.14430,,2403.14430.pdf,Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels,"This paper focuses on open-ended video question answering, which aims to find +the correct answers from a large answer set in response to a video-related +question. This is essentially a multi-label classification task, since a +question may have multiple answers. However, due to annotation costs, the +labels in existing benchmarks are always extremely insufficient, typically one +answer per question. As a result, existing works tend to directly treat all the +unlabeled answers as negative labels, leading to limited ability for +generalization. In this work, we introduce a simple yet effective ranking +distillation framework (RADI) to mitigate this problem without additional +manual annotation. RADI employs a teacher model trained with incomplete labels +to generate rankings for potential answers, which contain rich knowledge about +label priority as well as label-associated visual cues, thereby enriching the +insufficient labeling information. To avoid overconfidence in the imperfect +teacher model, we further present two robust and parameter-free ranking +distillation approaches: a pairwise approach which introduces adaptive soft +margins to dynamically refine the optimization constraints on various pairwise +rankings, and a listwise approach which adopts sampling-based partial listwise +learning to resist the bias in teacher ranking. Extensive experiments on five +popular benchmarks consistently show that both our pairwise and listwise RADIs +outperform state-of-the-art methods. Further analysis demonstrates the +effectiveness of our methods on the insufficient labeling problem.",cs.CV,['cs.CV'] +Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation,Jingyun Wang · Guoliang Kang, ,https://arxiv.org/abs/2403.04547,,2403.04547.pdf,CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?,"We study the effectiveness of data-balancing for mitigating biases in +contrastive language-image pretraining (CLIP), identifying areas of strength +and limitation. First, we reaffirm prior conclusions that CLIP models can +inadvertently absorb societal stereotypes. To counter this, we present a novel +algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both +representation and association biases (i.e. in first- and second-order +statistics) in multimodal data. We use M4 to conduct an in-depth analysis +taking into account various factors, such as the model, representation, and +data size. Our study also explores the dynamic nature of how CLIP learns and +unlearns biases. In particular, we find that fine-tuning is effective in +countering representation biases, though its impact diminishes for association +biases. Also, data balancing has a mixed impact on quality: it tends to improve +classification but can hurt retrieval. Interestingly, data and architectural +improvements seem to mitigate the negative impact of data balancing on +performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves +COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and +ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with +recommendations for improving the efficacy of data balancing in multimodal +systems.",cs.LG,"['cs.LG', 'cs.AI']" +FINER: Flexible spectral-bias tuning in Implicit NEural Representation by Variable-periodic Activation Functions,Zhen Liu · Hao Zhu · Qi Zhang · Jingde Fu · Weibing Deng · Zhan Ma · Yanwen Guo · Xun Cao, ,https://arxiv.org/abs/2312.02434,,2312.02434.pdf,FINER: Flexible spectral-bias tuning in Implicit NEural Representation by Variable-periodic Activation Functions,"Implicit Neural Representation (INR), which utilizes a neural network to map +coordinate inputs to corresponding attributes, is causing a revolution in the +field of signal processing. However, current INR techniques suffer from a +restricted capability to tune their supported frequency set, resulting in +imperfect performance when representing complex signals with multiple +frequencies. We have identified that this frequency-related problem can be +greatly alleviated by introducing variable-periodic activation functions, for +which we propose FINER. By initializing the bias of the neural network within +different ranges, sub-functions with various frequencies in the +variable-periodic function are selected for activation. Consequently, the +supported frequency set of FINER can be flexibly tuned, leading to improved +performance in signal representation. We demonstrate the capabilities of FINER +in the contexts of 2D image fitting, 3D signed distance field representation, +and 5D neural radiance fields optimization, and we show that it outperforms +existing INRs.",cs.CV,['cs.CV'] +InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning,Yan-Shuo Liang · Wu-Jun Li,https://github.com/liangyanshuo/InfLoRA,https://arxiv.org/abs/2404.00228,,2404.00228.pdf,InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning,"Continual learning requires the model to learn multiple tasks sequentially. +In continual learning, the model should possess the ability to maintain its +performance on old tasks (stability) and the ability to adapt to new tasks +continuously (plasticity). Recently, parameter-efficient fine-tuning (PEFT), +which involves freezing a pre-trained model and injecting a small number of +learnable parameters to adapt to downstream tasks, has gained increasing +popularity in continual learning. Although existing continual learning methods +based on PEFT have demonstrated superior performance compared to those not +based on PEFT, most of them do not consider how to eliminate the interference +of the new task on the old tasks, which inhibits the model from making a good +trade-off between stability and plasticity. In this work, we propose a new PEFT +method, called interference-free low-rank adaptation (InfLoRA), for continual +learning. InfLoRA injects a small number of parameters to reparameterize the +pre-trained weights and shows that fine-tuning these injected parameters is +equivalent to fine-tuning the pre-trained weights within a subspace. +Furthermore, InfLoRA designs this subspace to eliminate the interference of the +new task on the old tasks, making a good trade-off between stability and +plasticity. Experimental results show that InfLoRA outperforms existing +state-of-the-art continual learning methods on multiple datasets.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation,Yuanchen Wu · Xichen Ye · KequanYang · Jide Li · Xiaoqiang Li, ,https://arxiv.org/abs/2403.11184,,2403.11184.pdf,DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation,"Recently, One-stage Weakly Supervised Semantic Segmentation (WSSS) with +image-level labels has gained increasing interest due to simplification over +its cumbersome multi-stage counterpart. Limited by the inherent ambiguity of +Class Activation Map (CAM), we observe that one-stage pipelines often encounter +confirmation bias caused by incorrect CAM pseudo-labels, impairing their final +segmentation performance. Although recent works discard many unreliable +pseudo-labels to implicitly alleviate this issue, they fail to exploit +sufficient supervision for their models. To this end, we propose a dual student +framework with trustworthy progressive learning (DuPL). Specifically, we +propose a dual student network with a discrepancy loss to yield diverse CAMs +for each sub-net. The two sub-nets generate supervision for each other, +mitigating the confirmation bias caused by learning their own incorrect +pseudo-labels. In this process, we progressively introduce more trustworthy +pseudo-labels to be involved in the supervision through dynamic threshold +adjustment with an adaptive noise filtering strategy. Moreover, we believe that +every pixel, even discarded from supervision due to its unreliability, is +important for WSSS. Thus, we develop consistency regularization on these +discarded regions, providing supervision of every pixel. Experiment results +demonstrate the superiority of the proposed DuPL over the recent +state-of-the-art alternatives on PASCAL VOC 2012 and MS COCO datasets. Code is +available at https://github.com/Wu0409/DuPL.",cs.CV,['cs.CV'] +Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding,Chaolei Tan · Jianhuang Lai · Wei-Shi Zheng · Jian-Fang Hu, ,https://arxiv.org/abs/2403.11463v2,,2403.11463v2.pdf,Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding,"Video Paragraph Grounding (VPG) is an emerging task in video-language +understanding, which aims at localizing multiple sentences with semantic +relations and temporal order from an untrimmed video. However, existing VPG +approaches are heavily reliant on a considerable number of temporal labels that +are laborious and time-consuming to acquire. In this work, we introduce and +explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the +need of temporal annotations. Different from previous weakly-supervised +grounding frameworks based on multiple instance learning or reconstruction +learning for two-stage candidate ranking, we propose a novel siamese learning +framework that jointly learns the cross-modal feature alignment and temporal +coordinate regression without timestamp labels to achieve concise one-stage +localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer +(SiamGTR) consisting of two weight-sharing branches for learning complementary +supervision. An Augmentation Branch is utilized for directly regressing the +temporal boundaries of a complete paragraph within a pseudo video, and an +Inference Branch is designed to capture the order-guided feature correspondence +for localizing multiple sentences in a normal video. We demonstrate by +extensive experiments that our paradigm has superior practicability and +flexibility to achieve efficient weakly-supervised or semi-supervised learning, +outperforming state-of-the-art methods trained with the same or stronger +supervision.",cs.CV,['cs.CV'] +Prompt Augmentation for Self-supervised Text-guided Image Manipulation,Rumeysa Bodur · Binod Bhattarai · Tae-Kyun Kim, ,https://arxiv.org/html/2403.10255v1,,2403.10255v1.pdf,Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder,"Super-resolution (SR) and image generation are important tasks in computer +vision and are widely adopted in real-world applications. Most existing +methods, however, generate images only at fixed-scale magnification and suffer +from over-smoothing and artifacts. Additionally, they do not offer enough +diversity of output images nor image consistency at different scales. Most +relevant work applied Implicit Neural Representation (INR) to the denoising +diffusion model to obtain continuous-resolution yet diverse and high-quality SR +results. Since this model operates in the image space, the larger the +resolution of image is produced, the more memory and inference time is +required, and it also does not maintain scale-specific consistency. We propose +a novel pipeline that can super-resolve an input image or generate from a +random noise a novel image at arbitrary scales. The method consists of a +pretrained auto-encoder, a latent diffusion model, and an implicit neural +decoder, and their learning strategies. The proposed method adopts diffusion +processes in a latent space, thus efficient, yet aligned with output image +space decoded by MLPs at arbitrary scales. More specifically, our +arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling +from the pretrained auto-encoder, and Local Implicit Image Function (LIIF) in +series. The latent diffusion process is learnt by the denoising and the +alignment losses jointly. Errors in output images are backpropagated via the +fixed decoder, improving the quality of output images. In the extensive +experiments using multiple public benchmarks on the two tasks i.e. image +super-resolution and novel image generation at arbitrary scales, the proposed +method outperforms relevant methods in metrics of image quality, diversity and +scale consistency. It is significantly better than the relevant prior-art in +the inference speed and memory usage.",cs.CV,['cs.CV'] +"Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation",ZHIXIANG WEI · Lin Chen · Xiaoxiao Ma · Huaian Chen · Tianle Liu · Pengyang Ling · Jinjin Zheng · Ben Wang · Yi Jin,https://zxwei.site/rein/,https://arxiv.org/abs/2312.04265,,2312.04265.pdf,"Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation","In this paper, we first assess and harness various Vision Foundation Models +(VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). +Driven by the motivation that Leveraging Stronger pre-trained models and Fewer +trainable parameters for Superior generalizability, we introduce a robust +fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for +DGSS. Built upon a set of trainable tokens, each linked to distinct instances, +Rein precisely refines and forwards the feature maps from each layer to the +next layer within the backbone. This process produces diverse refinements for +different categories within a single image. With fewer trainable parameters, +Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full +parameter fine-tuning. Extensive experiments across various settings +demonstrate that Rein significantly outperforms state-of-the-art methods. +Remarkably, with just an extra 1% of trainable parameters within the frozen +backbone, Rein achieves a mIoU of 78.4% on the Cityscapes, without accessing +any real urban-scene datasets.Code is available at +https://github.com/w1oves/Rein.git.",cs.CV,['cs.CV'] +PoseGPT: Chatting about 3D Human Pose,Yao Feng · Jing Lin · Sai Kumar Dwivedi · Yu Sun · Priyanka Patel · Michael J. Black,https://yfeng95.github.io/ChatPose/,https://arxiv.org/abs/2311.18836,,2311.18836.pdf,ChatPose: Chatting about 3D Human Pose,"We introduce ChatPose, a framework employing Large Language Models (LLMs) to +understand and reason about 3D human poses from images or textual descriptions. +Our work is motivated by the human ability to intuitively understand postures +from a single image or a brief description, a process that intertwines image +interpretation, world knowledge, and an understanding of body language. +Traditional human pose estimation and generation methods often operate in +isolation, lacking semantic understanding and reasoning abilities. ChatPose +addresses these limitations by embedding SMPL poses as distinct signal tokens +within a multimodal LLM, enabling the direct generation of 3D body poses from +both textual and visual inputs. Leveraging the powerful capabilities of +multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks +while offering user interactions. Additionally, ChatPose empowers LLMs to apply +their extensive world knowledge in reasoning about human poses, leading to two +advanced tasks: speculative pose generation and reasoning about pose +estimation. These tasks involve reasoning about humans to generate 3D poses +from subtle text queries, possibly accompanied by images. We establish +benchmarks for these tasks, moving beyond traditional 3D pose generation and +estimation methods. Our results show that ChatPose outperforms existing +multimodal LLMs and task-specific methods on these newly proposed tasks. +Furthermore, ChatPose's ability to understand and generate 3D human poses based +on complex reasoning opens new directions in human pose analysis.",cs.CV,['cs.CV'] +SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds,Minghao Chen · Junyu Xie · Iro Laina · Andrea Vedaldi, ,,https://huggingface.co/papers/2312.09246,,,,,nan +Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach,Beichen Zhang · Xiaoxing Wang · Xiaohan Qin · Junchi Yan, ,https://arxiv.org/abs/2403.11380,,2403.11380.pdf,Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach,"Supernet is a core component in many recent Neural Architecture Search (NAS) +methods. It not only helps embody the search space but also provides a +(relative) estimation of the final performance of candidate architectures. +Thus, it is critical that the top architectures ranked by a supernet should be +consistent with those ranked by true performance, which is known as the +order-preserving ability. In this work, we analyze the order-preserving ability +on the whole search space (global) and a sub-space of top architectures +(local), and empirically show that the local order-preserving for current +two-stage NAS methods still need to be improved. To rectify this, we propose a +novel concept of Supernet Shifting, a refined search strategy combining +architecture searching with supernet fine-tuning. Specifically, apart from +evaluating, the training loss is also accumulated in searching and the supernet +is updated every iteration. Since superior architectures are sampled more +frequently in evolutionary searching, the supernet is encouraged to focus on +top architectures, thus improving local order-preserving. Besides, a +pre-trained supernet is often un-reusable for one-shot methods. We show that +Supernet Shifting can fulfill transferring supernet to a new dataset. +Specifically, the last classifier layer will be unset and trained through +evolutionary searching. Comprehensive experiments show that our method has +better order-preserving ability and can find a dominating architecture. +Moreover, the pre-trained supernet can be easily transferred into a new dataset +with no loss of performance.",cs.CV,['cs.CV'] +Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation,Yuan Xiao · Shiqing Ma · Juan Zhai · Chunrong Fang · Jinyuan Jia · Zhenyu Chen,https://github.com/xiaoyuanpigo/maxlin,,https://software.nju.edu.cn/English/News/Selected/20240228/i260151.html,,,,,nan +Harnessing Meta-Learning for Improving Full-Frame Video Stabilization,Muhammad Kashif Ali · Eun Woo Im · Dongjin Kim · Tae Hyun Kim, ,https://arxiv.org/abs/2403.03662v1,,2403.03662v1.pdf,Harnessing Meta-Learning for Improving Full-Frame Video Stabilization,"Video stabilization is a longstanding computer vision problem, particularly +pixel-level synthesis solutions for video stabilization which synthesize full +frames add to the complexity of this task. These techniques aim to stabilize +videos by synthesizing full frames while enhancing the stability of the +considered video. This intensifies the complexity of the task due to the +distinct mix of unique motion profiles and visual content present in each video +sequence, making robust generalization with fixed parameters difficult. In our +study, we introduce a novel approach to enhance the performance of pixel-level +synthesis solutions for video stabilization by adapting these models to +individual input video sequences. The proposed adaptation exploits low-level +visual cues accessible during test-time to improve both the stability and +quality of resulting videos. We highlight the efficacy of our methodology of +""test-time adaptation"" through simple fine-tuning of one of these models, +followed by significant stability gain via the integration of meta-learning +techniques. Notably, significant improvement is achieved with only a single +adaptation step. The versatility of the proposed algorithm is demonstrated by +consistently improving the performance of various pixel-level synthesis models +for video stabilization in real-world scenarios.",cs.CV,['cs.CV'] +SEED-Bench: Benchmarking Multimodal Large Language Models,Bohao Li · Yuying Ge · Yixiao Ge · Guangzhi Wang · Rui Wang · Ruimao Zhang · Ying Shan, ,https://arxiv.org/abs/2307.16125,,2307.16125.pdf,SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension,"Based on powerful Large Language Models (LLMs), recent generative Multimodal +Large Language Models (MLLMs) have gained prominence as a pivotal research +area, exhibiting remarkable capability for both comprehension and generation. +In this work, we address the evaluation of generative comprehension in MLLMs as +a preliminary step towards a comprehensive assessment of generative models, by +introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple +choice questions with accurate human annotations (x 6 larger than existing +benchmarks), which spans 12 evaluation dimensions including the comprehension +of both the image and video modality. We develop an advanced pipeline for +generating multiple-choice questions that target specific evaluation +dimensions, integrating both automatic filtering and manual verification +processes. Multiple-choice questions with groundtruth options derived from +human annotation enables an objective and efficient assessment of model +performance, eliminating the need for human or GPT intervention during +evaluation. We further evaluate the performance of 18 models across all 12 +dimensions, covering both the spatial and temporal understanding. By revealing +the limitations of existing MLLMs through evaluation results, we aim for +SEED-Bench to provide insights for motivating future research. We will launch +and consistently maintain a leaderboard to provide a platform for the community +to assess and investigate model capability.",cs.CL,"['cs.CL', 'cs.CV']" +Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views,Ziwei Zhao · Yuchen Wang · Chuhua Wang, ,,,,,,,nan +Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models,Zhang Li · Biao Yang · Qiang Liu · Zhiyin Ma · Shuo Zhang · Jingxu Yang · Yabo Sun · Yuliang Liu · Xiang Bai, ,https://arxiv.org/abs/2311.06607,,2311.06607.pdf,Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models,"Large Multimodal Models (LMMs) have shown promise in vision-language tasks +but struggle with high-resolution input and detailed scene understanding. +Addressing these challenges, we introduce Monkey to enhance LMM capabilities. +Firstly, Monkey processes input images by dividing them into uniform patches, +each matching the size (e.g., 448x448) used in the original training of the +well-trained vision encoder. Equipped with individual adapter for each patch, +Monkey can handle higher resolutions up to 1344x896 pixels, enabling the +detailed capture of complex visual information. Secondly, it employs a +multi-level description generation method, enriching the context for +scene-object associations. This two-part strategy ensures more effective +learning from generated data: the higher resolution allows for a more detailed +capture of visuals, which in turn enhances the effectiveness of comprehensive +descriptions. Extensive ablative results validate the effectiveness of our +designs. Additionally, experiments on 18 datasets further demonstrate that +Monkey surpasses existing LMMs in many tasks like Image Captioning and various +Visual Question Answering formats. Specially, in qualitative tests focused on +dense text question answering, Monkey has exhibited encouraging results +compared with GPT4V. Code is available at +https://github.com/Yuliang-Liu/Monkey.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +CPP-Net: Embracing Multi-Scale Feature Fusion into Deep Unfolding CP-PPA Network for Compressive Sensing,Zhen Guo · Hongping Gan, ,,https://www.mdpi.com/1099-4300/25/12/1579,,,,,nan +Revisiting Counterfactual Problems in Referring Expression Comprehension,Zhihan Yu · Ruifan Li, ,,https://link.springer.com/chapter/10.1007/978-3-031-41682-8_25,,,,,nan +AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring,Xintian Mao · Xiwen Gao · Yan Wang,https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur,https://arxiv.org/abs/2402.06117,,2402.06117.pdf,Spatially-Attentive Patch-Hierarchical Network with Adaptive Sampling for Motion Deblurring,"This paper tackles the problem of motion deblurring of dynamic scenes. +Although end-to-end fully convolutional designs have recently advanced the +state-of-the-art in non-uniform motion deblurring, their performance-complexity +trade-off is still sub-optimal. Most existing approaches achieve a large +receptive field by increasing the number of generic convolution layers and +kernel size. In this work, we propose a pixel adaptive and feature attentive +design for handling large blur variations across different spatial locations +and process each test image adaptively. We design a content-aware global-local +filtering module that significantly improves performance by considering not +only global dependencies but also by dynamically exploiting neighboring pixel +information. We further introduce a pixel-adaptive non-uniform sampling +strategy that implicitly discovers the difficult-to-restore regions present in +the image and, in turn, performs fine-grained refinement in a progressive +manner. Extensive qualitative and quantitative comparisons with prior art on +deblurring benchmarks demonstrate that our approach performs favorably against +the state-of-the-art deblurring algorithms.",cs.CV,['cs.CV'] +E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator,Wenjun Wu · Lingling Zhang · Jun Liu · Xi Tang · Yaxian Wang · Shaowei Wang · QianYing Wang, ,https://arxiv.org/abs/2401.16287,,2401.16287.pdf,GAPS: Geometry-Aware Problem Solver,"Geometry problem solving presents a formidable challenge within the NLP +community. Existing approaches often rely on models designed for solving math +word problems, neglecting the unique characteristics of geometry math problems. +Additionally, the current research predominantly focuses on geometry +calculation problems, while overlooking other essential aspects like proving. +In this study, we address these limitations by proposing the Geometry-Aware +Problem Solver (GAPS) model. GAPS is specifically designed to generate solution +programs for geometry math problems of various types with the help of its +unique problem-type classifier. To achieve this, GAPS treats the solution +program as a composition of operators and operands, segregating their +generation processes. Furthermore, we introduce the geometry elements +enhancement method, which enhances the ability of GAPS to recognize geometry +elements accurately. By leveraging these improvements, GAPS showcases +remarkable performance in resolving geometry math problems. Our experiments +conducted on the UniGeo dataset demonstrate the superiority of GAPS over the +state-of-the-art model, Geoformer. Specifically, GAPS achieves an accuracy +improvement of more than 5.3% for calculation tasks and an impressive 41.1% for +proving tasks. Notably, GAPS achieves an impressive accuracy of 97.5% on +proving problems, representing a significant advancement in solving geometry +proving tasks.",cs.AI,"['cs.AI', 'cs.CL']" +IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection,Junbo Yin · Wenguan Wang · Runnan Chen · Wei Li · Ruigang Yang · Pascal Frossard · Jianbing Shen,https://github.com/yinjunbo/IS-Fusion,https://arxiv.org/abs/2403.15241,,2403.15241.pdf,IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection,"Bird's eye view (BEV) representation has emerged as a dominant solution for +describing 3D space in autonomous driving scenarios. However, objects in the +BEV representation typically exhibit small sizes, and the associated point +cloud context is inherently sparse, which leads to great challenges for +reliable 3D perception. In this paper, we propose IS-Fusion, an innovative +multimodal fusion framework that jointly captures the Instance- and Scene-level +contextual information. IS-Fusion essentially differs from existing approaches +that only focus on the BEV scene-level fusion by explicitly incorporating +instance-level multimodal information, thus facilitating the instance-centric +tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) +module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid +and Grid-to-Region transformers to capture the multimodal scene context at +different granularities. IGF mines instance candidates, explores their +relationships, and aggregates the local multimodal context for each instance. +These instances then serve as guidance to enhance the scene feature and yield +an instance-aware BEV representation. On the challenging nuScenes benchmark, +IS-Fusion outperforms all the published multimodal works to date. Code is +available at: https://github.com/yinjunbo/IS-Fusion.",cs.CV,['cs.CV'] +Open-Vocabulary Semantic Segmentation with Image Embedding Balancing,Xiangheng Shan · Dongyue Wu · Guilin Zhu · Yuanjie Shao · Nong Sang · Changxin Gao, ,https://arxiv.org/abs/2312.04089,,2312.04089.pdf,Open-Vocabulary Segmentation with Semantic-Assisted Calibration,"This paper studies open-vocabulary segmentation (OVS) through calibrating +in-vocabulary and domain-biased embedding space with generalized contextual +prior of CLIP. As the core of open-vocabulary understanding, alignment of +visual content with the semantics of unbounded text has become the bottleneck +of this field. To address this challenge, recent works propose to utilize CLIP +as an additional classifier and aggregate model predictions with CLIP +classification results. Despite their remarkable progress, performance of OVS +methods in relevant scenarios is still unsatisfactory compared with supervised +counterparts. We attribute this to the in-vocabulary embedding and +domain-biased CLIP prediction. To this end, we present a Semantic-assisted +CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior +of CLIP into proposal embedding to avoid collapsing on known categories. +Besides, a contextual shift strategy is applied to mitigate the lack of global +context and unnatural background noise. With above designs, SCAN achieves +state-of-the-art performance on all popular open-vocabulary segmentation +benchmarks. Furthermore, we also focus on the problem of existing evaluation +system that ignores semantic duplication across categories, and propose a new +metric called Semantic-Guided IoU (SG-IoU).",cs.CV,['cs.CV'] +Doubly Abductive Counterfactual Inference for Text-based Image Editing,Xue Song · Jiequan Cui · Hanwang Zhang · Jingjing Chen · Richang Hong · Yu-Gang Jiang,https://github.com/xuesong39/DAC,https://arxiv.org/abs/2403.02981,,2403.02981.pdf,Doubly Abductive Counterfactual Inference for Text-based Image Editing,"We study text-based image editing (TBIE) of a single image by counterfactual +inference because it is an elegant formulation to precisely address the +requirement: the edited image should retain the fidelity of the original one. +Through the lens of the formulation, we find that the crux of TBIE is that +existing techniques hardly achieve a good trade-off between editability and +fidelity, mainly due to the overfitting of the single-image fine-tuning. To +this end, we propose a Doubly Abductive Counterfactual inference framework +(DAC). We first parameterize an exogenous variable as a UNet LoRA, whose +abduction can encode all the image details. Second, we abduct another exogenous +variable parameterized by a text encoder LoRA, which recovers the lost +editability caused by the overfitted first abduction. Thanks to the second +abduction, which exclusively encodes the visual transition from post-edit to +pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit +back to post-edit, thereby accomplishing the edit. Through extensive +experiments, our DAC achieves a good trade-off between editability and +fidelity. Thus, we can support a wide spectrum of user editing intents, +including addition, removal, manipulation, replacement, style transfer, and +facial change, which are extensively validated in both qualitative and +quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.",cs.CV,['cs.CV'] +SfmCAD: Unsupervised CAD Reconstruction by Learning Sketch-based Feature Modeling Operations,Pu Li · Jianwei Guo · HUIBIN LI · Bedrich Benes · Dong-Ming Yan, ,https://ar5iv.labs.arxiv.org/html/2303.10613,,2303.10613.pdf,SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations,"Reverse engineering CAD models from raw geometry is a classic but strenuous +research problem. Previous learning-based methods rely heavily on labels due to +the supervised design patterns or reconstruct CAD shapes that are not easily +editable. In this work, we introduce SECAD-Net, an end-to-end neural network +aimed at reconstructing compact and easy-to-edit CAD models in a +self-supervised manner. Drawing inspiration from the modeling language that is +most commonly used in modern CAD software, we propose to learn 2D sketches and +3D extrusion parameters from raw shapes, from which a set of extrusion +cylinders can be generated by extruding each sketch from a 2D plane into a 3D +body. By incorporating the Boolean operation (i.e., union), these cylinders can +be combined to closely approximate the target geometry. We advocate the use of +implicit fields for sketch representation, which allows for creating CAD +variations by interpolating latent codes in the sketch latent space. Extensive +experiments on both ABC and Fusion 360 datasets demonstrate the effectiveness +of our method, and show superiority over state-of-the-art alternatives +including the closely related method for supervised CAD reconstruction. We +further apply our approach to CAD editing and single-view CAD reconstruction. +The code is released at https://github.com/BunnySoCrazy/SECAD-Net.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Grounding and Enhancing Grid-based Models for Neural Fields,Zelin Zhao · FENGLEI FAN · Wenlong Liao · Junchi Yan,https://sites.google.com/view/cvpr24-2034-submission/home,https://arxiv.org/abs/2403.20002,,2403.20002.pdf,Grounding and Enhancing Grid-based Models for Neural Fields,"Many contemporary studies utilize grid-based models for neural field +representation, but a systematic analysis of grid-based models is still +missing, hindering the improvement of those models. Therefore, this paper +introduces a theoretical framework for grid-based models. This framework points +out that these models' approximation and generalization behaviors are +determined by grid tangent kernels (GTK), which are intrinsic properties of +grid-based models. The proposed framework facilitates a consistent and +systematic analysis of diverse grid-based models. Furthermore, the introduced +framework motivates the development of a novel grid-based model named the +Multiplicative Fourier Adaptive Grid (MulFAGrid). The numerical analysis +demonstrates that MulFAGrid exhibits a lower generalization bound than its +predecessors, indicating its robust generalization performance. Empirical +studies reveal that MulFAGrid achieves state-of-the-art performance in various +tasks, including 2D image fitting, 3D signed distance field (SDF) +reconstruction, and novel view synthesis, demonstrating superior representation +ability. The project website is available at +https://sites.google.com/view/cvpr24-2034-submission/home.",cs.CV,['cs.CV'] +Language Model Guided Interpretable Video Action Reasoning,Ning Wang · Guangming Zhu · Hongsheng Li · Liang Zhang · Syed Afaq Ali Shah · Mohammed Bennamoun, ,https://arxiv.org/abs/2404.01591,,2404.01591.pdf,Language Model Guided Interpretable Video Action Reasoning,"While neural networks have excelled in video action recognition tasks, their +black-box nature often obscures the understanding of their decision-making +processes. Recent approaches used inherently interpretable models to analyze +video actions in a manner akin to human reasoning. These models, however, +usually fall short in performance compared to their black-box counterparts. In +this work, we present a new framework named Language-guided Interpretable +Action Recognition framework (LaIAR). LaIAR leverages knowledge from language +models to enhance both the recognition capabilities and the interpretability of +video models. In essence, we redefine the problem of understanding video model +decisions as a task of aligning video and language models. Using the logical +reasoning captured by the language model, we steer the training of the video +model. This integrated approach not only improves the video model's +adaptability to different domains but also boosts its overall performance. +Extensive experiments on two complex video action datasets, Charades & CAD-120, +validates the improved performance and interpretability of our LaIAR framework. +The code of LaIAR is available at https://github.com/NingWang2049/LaIAR.",cs.CV,['cs.CV'] +4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling,Sherwin Bahmani · Ivan Skorokhodov · Victor Rong · Gordon Wetzstein · Leonidas Guibas · Peter Wonka · Sergey Tulyakov · Jeong Joon Park · Andrea Tagliasacchi · David B. Lindell,https://sherwinbahmani.github.io/4dfy,https://arxiv.org/abs/2311.17984,,2311.17984.pdf,4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling,"Recent breakthroughs in text-to-4D generation rely on pre-trained +text-to-image and text-to-video models to generate dynamic 3D scenes. However, +current text-to-4D methods face a three-way tradeoff between the quality of +scene appearance, 3D structure, and motion. For example, text-to-image models +and their 3D-aware variants are trained on internet-scale image datasets and +can be used to produce scenes with realistic appearance and 3D structure -- but +no motion. Text-to-video models are trained on relatively smaller video +datasets and can produce scenes with motion, but poorer appearance and 3D +structure. While these models have complementary strengths, they also have +opposing weaknesses, making it difficult to combine them in a way that +alleviates this three-way tradeoff. Here, we introduce hybrid score +distillation sampling, an alternating optimization procedure that blends +supervision signals from multiple pre-trained diffusion models and incorporates +benefits of each for high-fidelity text-to-4D generation. Using hybrid SDS, we +demonstrate synthesis of 4D scenes with compelling appearance, 3D structure, +and motion.",cs.CV,['cs.CV'] +Single-Model and Any-Modality for Video Object Tracking,Zongwei Wu · Jilai Zheng · Xiangxuan Ren · Florin-Alexandru Vasluianu · Chao Ma · Danda Paudel · Luc Van Gool · Radu Timofte,https://github.com/Zongwei97/UnTrack,https://arxiv.org/abs/2311.15851,,2311.15851.pdf,Single-Model and Any-Modality for Video Object Tracking,"In the realm of video object tracking, auxiliary modalities such as depth, +thermal, or event data have emerged as valuable assets to complement the RGB +trackers. In practice, most existing RGB trackers learn a single set of +parameters to use them across datasets and applications. However, a similar +single-model unification for multi-modality tracking presents several +challenges. These challenges stem from the inherent heterogeneity of inputs -- +each with modality-specific representations, the scarcity of multi-modal +datasets, and the absence of all the modalities at all times. In this work, we +introduce Un-Track, a Unified Tracker of a single set of parameters for any +modality. To handle any modality, our method learns their common latent space +through low-rank factorization and reconstruction techniques. More importantly, +we use only the RGB-X pairs to learn the common latent space. This unique +shared representation seamlessly binds all modalities together, enabling +effective unification and accommodating any missing modality, all within a +single transformer-based architecture. Our Un-Track achieves +8.1 absolute +F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) +GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient +prompting strategy. Extensive comparisons on five benchmark datasets with +different modalities show that Un-Track surpasses both SOTA unified trackers +and modality-specific counterparts, validating our effectiveness and +practicality. The source code is publicly available at +https://github.com/Zongwei97/UnTrack.",cs.CV,['cs.CV'] +Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion,Xunpeng Yi · Han Xu · HAO ZHANG · Linfeng Tang · Jiayi Ma, ,https://arxiv.org/abs/2403.16387,,2403.16387.pdf,Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion,"Image fusion aims to combine information from different source images to +create a comprehensively representative image. Existing fusion methods are +typically helpless in dealing with degradations in low-quality source images +and non-interactive to multiple subjective and objective needs. To solve them, +we introduce a novel approach that leverages semantic text guidance image +fusion model for degradation-aware and interactive image fusion task, termed as +Text-IF. It innovatively extends the classical image fusion to the text guided +image fusion along with the ability to harmoniously address the degradation and +interaction issues during fusion. Through the text semantic encoder and +semantic interaction fusion decoder, Text-IF is accessible to the all-in-one +infrared and visible image degradation-aware processing and the interactive +flexible fusion outcomes. In this way, Text-IF achieves not only multi-modal +image fusion, but also multi-modal information fusion. Extensive experiments +prove that our proposed text guided image fusion strategy has obvious +advantages over SOTA methods in the image fusion performance and degradation +treatment. The code is available at https://github.com/XunpengYi/Text-IF.",cs.CV,['cs.CV'] +TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation,Sai Kumar Dwivedi · Yu Sun · Priyanka Patel · Yao Feng · Michael J. Black,https://tokenhmr.is.tue.mpg.de/,https://arxiv.org/abs/2404.16752,,2404.16752.pdf,TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation,"We address the problem of regressing 3D human pose and shape from a single +image, with a focus on 3D accuracy. The current best methods leverage large +datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust +performance. With such methods, we observe a paradoxical decline in 3D pose +accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and +the use of an approximate camera projection model. We quantify the error +induced by current camera models and show that fitting 2D keypoints and p-GT +accurately causes incorrect 3D poses. Our analysis defines the invalid +distances within which minimizing 2D and p-GT losses is detrimental. We use +this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that +penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, +there are many 3D poses that could equally explain the 2D evidence. To reduce +this ambiguity we need a prior over valid human poses but such priors can +introduce unwanted bias. To address this, we exploit a tokenized representation +of human pose and reformulate the problem as token prediction. This restricts +the estimated poses to the space of valid poses, effectively providing a +uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that +our reformulated keypoint loss and tokenization allows us to train on +in-the-wild data while improving 3D accuracy over the state-of-the-art. Our +models and code are available for research at https://tokenhmr.is.tue.mpg.de.",cs.CV,['cs.CV'] +Unifying Top-down and Bottom-up Scanpath Prediction using Transformers,Zhibo Yang · Sounak Mondal · Seoyoung Ahn · Ruoyu Xue · Gregory Zelinsky · Minh Hoai · Dimitris Samaras,https://github.com/cvlab-stonybrook/HAT,https://arxiv.org/html/2303.09383v3,,2303.09383v3.pdf,Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers,"Most models of visual attention aim at predicting either top-down or +bottom-up control, as studied using different visual search and free-viewing +tasks. In this paper we propose the Human Attention Transformer (HAT), a single +model that predicts both forms of attention control. HAT uses a novel +transformer-based architecture and a simplified foveated retina that +collectively create a spatio-temporal awareness akin to the dynamic visual +working memory of humans. HAT not only establishes a new state-of-the-art in +predicting the scanpath of fixations made during target-present and +target-absent visual search and ``taskless'' free viewing, but also makes human +gaze behavior interpretable. Unlike previous methods that rely on a coarse grid +of fixation cells and experience information loss due to fixation +discretization, HAT features a sequential dense prediction architecture and +outputs a dense heatmap for each fixation, thus avoiding discretizing +fixations. HAT sets a new standard in computational attention, which emphasizes +effectiveness, generality, and interpretability. HAT's demonstrated scope and +applicability will likely inspire the development of new attention models that +can better predict human behavior in various attention-demanding scenarios. +Code is available at https://github.com/cvlab-stonybrook/HAT.",cs.CV,"['cs.CV', 'cs.AI']" +Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning,Chen Zhao · Shuming Liu · Karttikeya Mangalam · Guocheng Qian · Fatimah Zohra · Abdulmohsen Alghannam · Jitendra Malik · Bernard Ghanem, ,https://arxiv.org/abs/2401.04105,,2401.04105.pdf,Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning,"Large pretrained models are increasingly crucial in modern computer vision +tasks. These models are typically used in downstream tasks by end-to-end +finetuning, which is highly memory-intensive for tasks with high-resolution +data, e.g., video understanding, small object detection, and point cloud +analysis. In this paper, we propose Dynamic Reversible Dual-Residual Networks, +or Dr$^2$Net, a novel family of network architectures that acts as a surrogate +network to finetune a pretrained model with substantially reduced memory +consumption. Dr$^2$Net contains two types of residual connections, one +maintaining the residual structure in the pretrained models, and the other +making the network reversible. Due to its reversibility, intermediate +activations, which can be reconstructed from output, are cleared from memory +during training. We use two coefficients on either type of residual connections +respectively, and introduce a dynamic training strategy that seamlessly +transitions the pretrained model to a reversible network with much higher +numerical precision. We evaluate Dr$^2$Net on various pretrained models and +various tasks, and show that it can reach comparable performance to +conventional finetuning but with significantly less memory usage.",cs.CV,"['cs.CV', 'cs.AI']" +SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction,Pin Tang · Zhongdao Wang · Guoqing Wang · Jilai Zheng · Xiangxuan Ren · Bailan Feng · Chao Ma, ,https://arxiv.org/abs/2404.09502,,2404.09502.pdf,SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction,"Vision-based perception for autonomous driving requires an explicit modeling +of a 3D space, where 2D latent representations are mapped and subsequent 3D +operators are applied. However, operating on dense latent spaces introduces a +cubic time and space complexity, which limits scalability in terms of +perception range or spatial resolution. Existing approaches compress the dense +representation using projections like Bird's Eye View (BEV) or Tri-Perspective +View (TPV). Although efficient, these projections result in information loss, +especially for tasks like semantic occupancy prediction. To address this, we +propose SparseOcc, an efficient occupancy network inspired by sparse point +cloud processing. It utilizes a lossless sparse latent representation with +three key innovations. Firstly, a 3D sparse diffuser performs latent completion +using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature +pyramid and sparse interpolation enhance scales with information from others. +Finally, the transformer head is redesigned as a sparse variant. SparseOcc +achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. +Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in +part can be attributed to the sparse representation's ability to avoid +hallucinations on empty voxels.",cs.CV,['cs.CV'] +4D Gaussian Splatting for Real-Time Dynamic Scene Rendering,Guanjun Wu · Taoran Yi · Jiemin Fang · Lingxi Xie · Xiaopeng Zhang · Wei Wei · Wenyu Liu · Qi Tian · Xinggang Wang,guanjunwu.github.io/4dgs,https://arxiv.org/abs/2310.08528,,2310.08528.pdf,4D Gaussian Splatting for Real-Time Dynamic Scene Rendering,"Representing and rendering dynamic scenes has been an important but +challenging task. Especially, to accurately model complex motions, high +efficiency is usually hard to guarantee. To achieve real-time dynamic scene +rendering while also enjoying high training and storage efficiency, we propose +4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes +rather than applying 3D-GS for each individual frame. In 4D-GS, a novel +explicit representation containing both 3D Gaussians and 4D neural voxels is +proposed. A decomposed neural voxel encoding algorithm inspired by HexPlane is +proposed to efficiently build Gaussian features from 4D neural voxels and then +a lightweight MLP is applied to predict Gaussian deformations at novel +timestamps. Our 4D-GS method achieves real-time rendering under high +resolutions, 82 FPS at an 800$\times$800 resolution on an RTX 3090 GPU while +maintaining comparable or better quality than previous state-of-the-art +methods. More demos and code are available at +https://guanjunwu.github.io/4dgs/.",cs.CV,"['cs.CV', 'cs.GR']" +Open-Set Domain Adaptation for Semantic Segmentation,Seun-An Choe · Ah-Hyung Shin · Keon Hee Park · Jinwoo Choi · Gyeong-Moon Park, ,https://arxiv.org/abs/2405.19899,,2405.19899.pdf,Open-Set Domain Adaptation for Semantic Segmentation,"Unsupervised domain adaptation (UDA) for semantic segmentation aims to +transfer the pixel-wise knowledge from the labeled source domain to the +unlabeled target domain. However, current UDA methods typically assume a shared +label space between source and target, limiting their applicability in +real-world scenarios where novel categories may emerge in the target domain. In +this paper, we introduce Open-Set Domain Adaptation for Semantic Segmentation +(OSDA-SS) for the first time, where the target domain includes unknown classes. +We identify two major problems in the OSDA-SS scenario as follows: 1) the +existing UDA methods struggle to predict the exact boundary of the unknown +classes, and 2) they fail to accurately predict the shape of the unknown +classes. To address these issues, we propose Boundary and Unknown Shape-Aware +open-set domain adaptation, coined BUS. Our BUS can accurately discern the +boundaries between known and unknown classes in a contrastive manner using a +novel dilation-erosion-based contrastive loss. In addition, we propose +OpenReMix, a new domain mixing augmentation method that guides our model to +effectively learn domain and size-invariant features for improving the shape +detection of the known and unknown classes. Through extensive experiments, we +demonstrate that our proposed BUS effectively detects unknown classes in the +challenging OSDA-SS scenario compared to the previous methods by a large +margin. The code is available at https://github.com/KHU-AGI/BUS.",cs.CV,"['cs.CV', 'cs.AI']" +Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening,Yule Duan · Xiao Wu · Haoyu Deng · Liang-Jian Deng,https://github.com/Duanyll/CANConv,https://arxiv.org/abs/2404.07543,,2404.07543.pdf,Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening,"Currently, machine learning-based methods for remote sensing pansharpening +have progressed rapidly. However, existing pansharpening methods often do not +fully exploit differentiating regional information in non-local spaces, thereby +limiting the effectiveness of the methods and resulting in redundant learning +parameters. In this paper, we introduce a so-called content-adaptive non-local +convolution (CANConv), a novel method tailored for remote sensing image +pansharpening. Specifically, CANConv employs adaptive convolution, ensuring +spatial adaptability, and incorporates non-local self-similarity through the +similarity relationship partition (SRP) and the partition-wise adaptive +convolution (PWAC) sub-modules. Furthermore, we also propose a corresponding +network architecture, called CANNet, which mainly utilizes the multi-scale +self-similarity. Extensive experiments demonstrate the superior performance of +CANConv, compared with recent promising fusion methods. Besides, we +substantiate the method's effectiveness through visualization, ablation +experiments, and comparison with existing methods on multiple test sets. The +source code is publicly available at https://github.com/duanyll/CANConv.",cs.CV,"['cs.CV', 'eess.IV']" +GSVA: Generalized Segmentation via Multimodal Large Language Models,Zhuofan Xia · Dongchen Han · Yizeng Han · Xuran Pan · Shiji Song · Gao Huang,https://github.com/LeapLabTHU/GSVA,https://arxiv.org/abs/2312.10103,,2312.10103.pdf,GSVA: Generalized Segmentation via Multimodal Large Language Models,"Generalized Referring Expression Segmentation (GRES) extends the scope of +classic RES to refer to multiple objects in one expression or identify the +empty targets absent in the image. GRES poses challenges in modeling the +complex spatial relationships of the instances in the image and identifying +non-existing referents. Multimodal Large Language Models (MLLMs) have recently +shown tremendous progress in these complicated vision-language tasks. +Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient +in understanding contexts with visual inputs. Among them, LISA, as a +representative, adopts a special [SEG] token to prompt a segmentation mask +decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing +solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot +correctly handle the cases where users might reference multiple subjects in a +singular prompt or provide descriptions incongruent with any image target. In +this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to +address this gap. Specifically, GSVA reuses the [SEG] token to prompt the +segmentation model towards supporting multiple mask references simultaneously +and innovatively learns to generate a [REJ] token to reject the null targets +explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue, +marking a notable enhancement and setting a new record on the GRES benchmark +gRefCOCO dataset. GSVA also proves effective across various classic referring +segmentation and comprehension tasks.",cs.CV,['cs.CV'] +S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data,Xuyang Li · Danfeng Hong · Jocelyn Chanussot, ,https://arxiv.org/abs/2311.07113,,2311.07113.pdf,SpectralGPT: Spectral Remote Sensing Foundation Model,"The foundation model has recently garnered significant attention due to its +potential to revolutionize the field of visual representation learning in a +self-supervised manner. While most foundation models are tailored to +effectively process RGB images for various visual tasks, there is a noticeable +gap in research focused on spectral data, which offers valuable information for +scene understanding, especially in remote sensing (RS) applications. To fill +this gap, we created for the first time a universal RS foundation model, named +SpectralGPT, which is purpose-built to handle spectral RS images using a novel +3D generative pretrained transformer (GPT). Compared to existing foundation +models, SpectralGPT 1) accommodates input images with varying sizes, +resolutions, time series, and regions in a progressive training fashion, +enabling full utilization of extensive RS big data; 2) leverages 3D token +generation for spatial-spectral coupling; 3) captures spectrally sequential +patterns via multi-target reconstruction; 4) trains on one million spectral RS +images, yielding models with over 600 million parameters. Our evaluation +highlights significant performance improvements with pretrained SpectralGPT +models, signifying substantial potential in advancing spectral RS big data +applications within the field of geoscience across four downstream tasks: +single/multi-label scene classification, semantic segmentation, and change +detection.",cs.CV,['cs.CV'] +PointOBB: Learning Oriented Object Detection via Single Point Supervision,Junwei Luo · Xue Yang · Yi Yu · Qingyun Li · Junchi Yan · Yansheng Li, ,https://arxiv.org/abs/2311.14757,,2311.14757.pdf,PointOBB: Learning Oriented Object Detection via Single Point Supervision,"Single point-supervised object detection is gaining attention due to its +cost-effectiveness. However, existing approaches focus on generating horizontal +bounding boxes (HBBs) while ignoring oriented bounding boxes (OBBs) commonly +used for objects in aerial images. This paper proposes PointOBB, the first +single Point-based OBB generation method, for oriented object detection. +PointOBB operates through the collaborative utilization of three distinctive +views: an original view, a resized view, and a rotated/flipped (rot/flp) view. +Upon the original view, we leverage the resized and rot/flp views to build a +scale augmentation module and an angle acquisition module, respectively. In the +former module, a Scale-Sensitive Consistency (SSC) loss is designed to enhance +the deep network's ability to perceive the object scale. For accurate object +angle predictions, the latter module incorporates self-supervised learning to +predict angles, which is associated with a scale-guided Dense-to-Sparse (DS) +matching strategy for aggregating dense angles corresponding to sparse objects. +The resized and rot/flp views are switched using a progressive multi-view +switching strategy during training to achieve coupled optimization of scale and +angle. Experimental results on the DIOR-R and DOTA-v1.0 datasets demonstrate +that PointOBB achieves promising performance, and significantly outperforms +potential point-supervised baselines.",cs.CV,"['cs.CV', 'cs.AI']" +Long-Tail Class Incremental Learning via Independent Sub-prototype Construction,Xi Wang · Xu Yang · Jie Yin · Kun Wei · Cheng Deng, ,https://ar5iv.labs.arxiv.org/html/2210.00266,,2210.00266.pdf,Long-Tailed Class Incremental Learning,"In class incremental learning (CIL) a model must learn new classes in a +sequential manner without forgetting old ones. However, conventional CIL +methods consider a balanced distribution for each new task, which ignores the +prevalence of long-tailed distributions in the real world. In this work we +propose two long-tailed CIL scenarios, which we term ordered and shuffled +LT-CIL. Ordered LT-CIL considers the scenario where we learn from head classes +collected with more samples than tail classes which have few. Shuffled LT-CIL, +on the other hand, assumes a completely random long-tailed distribution for +each task. We systematically evaluate existing methods in both LT-CIL scenarios +and demonstrate very different behaviors compared to conventional CIL +scenarios. Additionally, we propose a two-stage learning baseline with a +learnable weight scaling layer for reducing the bias caused by long-tailed +distribution in LT-CIL and which in turn also improves the performance of +conventional CIL due to the limited exemplars. Our results demonstrate the +superior performance (up to 6.44 points in average incremental accuracy) of our +approach on CIFAR-100 and ImageNet-Subset. The code is available at +https://github.com/xialeiliu/Long-Tailed-CIL",cs.CV,['cs.CV'] +FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer,Dongyeong Hwang · Hyunju Kim · Sunwoo Kim · Kijung Shin, ,https://arxiv.org/abs/2403.12821,,2403.12821.pdf,FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer,"The success of a specific neural network architecture is closely tied to the +dataset and task it tackles; there is no one-size-fits-all solution. Thus, +considerable efforts have been made to quickly and accurately estimate the +performances of neural architectures, without full training or evaluation, for +given tasks and datasets. Neural architecture encoding has played a crucial +role in the estimation, and graphbased methods, which treat an architecture as +a graph, have shown prominent performance. For enhanced representation learning +of neural architectures, we introduce FlowerFormer, a powerful graph +transformer that incorporates the information flows within a neural +architecture. FlowerFormer consists of two key components: (a) bidirectional +asynchronous message passing, inspired by the flows; (b) global attention built +on flow-based masking. Our extensive experiments demonstrate the superiority of +FlowerFormer over existing neural encoding methods, and its effectiveness +extends beyond computer vision models to include graph neural networks and auto +speech recognition models. Our code is available at +http://github.com/y0ngjaenius/CVPR2024_FLOWERFormer.",cs.LG,"['cs.LG', 'cs.AI']" +Convolutional Prompting meets Language Models for Continual Learning,Anurag Roy · Riddhiman Moulick · Vinay Verma · Saptarshi Ghosh · Abir Das,https://cvir.github.io/projects/convprompt.html,https://arxiv.org/abs/2403.20317,,2403.20317.pdf,Convolutional Prompting meets Language Models for Continual Learning,"Continual Learning (CL) enables machine learning models to learn from +continuously shifting new training data in absence of data from old tasks. +Recently, pretrained vision transformers combined with prompt tuning have shown +promise for overcoming catastrophic forgetting in CL. These approaches rely on +a pool of learnable prompts which can be inefficient in sharing knowledge +across tasks leading to inferior performance. In addition, the lack of +fine-grained layer specific prompts does not allow these to fully express the +strength of the prompts for CL. We address these limitations by proposing +ConvPrompt, a novel convolutional prompt creation mechanism that maintains +layer-wise shared embeddings, enabling both layer-specific learning and better +concept transfer across tasks. The intelligent use of convolution enables us to +maintain a low parameter overhead without compromising performance. We further +leverage Large Language Models to generate fine-grained text descriptions of +each category which are used to get task similarity and dynamically decide the +number of prompts to be learned. Extensive experiments demonstrate the +superiority of ConvPrompt and improves SOTA by ~3% with significantly less +parameter overhead. We also perform strong ablation over various modules to +disentangle the importance of different components.",cs.CV,['cs.CV'] +As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors,Seungwoo Yoo · Kunho Kim · Vladimir G. Kim · Minhyuk Sung, ,https://arxiv.org/abs/2311.16739,,2311.16739.pdf,As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors,"We present As-Plausible-as-Possible (APAP) mesh deformation technique that +leverages 2D diffusion priors to preserve the plausibility of a mesh under +user-controlled deformation. Our framework uses per-face Jacobians to represent +mesh deformations, where mesh vertex coordinates are computed via a +differentiable Poisson Solve. The deformed mesh is rendered, and the resulting +2D image is used in the Score Distillation Sampling (SDS) process, which +enables extracting meaningful plausibility priors from a pretrained 2D +diffusion model. To better preserve the identity of the edited mesh, we +fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a +user-prescribed handle displacement are then backpropagated to the per-face +Jacobians, and we use iterative gradient descent to compute the final +deformation that balances between the user edit and the output plausibility. We +evaluate our method with 2D and 3D meshes and demonstrate qualitative and +quantitative improvements when using plausibility priors over +geometry-preservation or distortion-minimization priors used by previous +techniques. Our project page is at: https://as-plausible-aspossible.github.io/",cs.CV,"['cs.CV', 'cs.GR']" +MR-VNet: Media Restoration using Volterra Networks,Siddharth Roheda · Amit Unde · Loay Rashid, ,,https://ieeexplore.ieee.org/document/10251925,,,,,nan +Low-Latency Neural Stereo Streaming,Qiqi Hou · Farzad Farhadzadeh · Amir Said · Guillaume Sautiere · Hoang Le, ,https://arxiv.org/html/2403.17879v1,,2403.17879v1.pdf,Low-Latency Neural Stereo Streaming,"The rise of new video modalities like virtual reality or autonomous driving +has increased the demand for efficient multi-view video compression methods, +both in terms of rate-distortion (R-D) performance and in terms of delay and +runtime. While most recent stereo video compression approaches have shown +promising performance, they compress left and right views sequentially, leading +to poor parallelization and runtime performance. This work presents Low-Latency +neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video +coding method designed for fast and efficient low-latency stereo video +streaming. Instead of using a sequential cross-view motion compensation like +existing methods, LLSS introduces a bidirectional feature shifting module to +directly exploit mutual information among views and encode them effectively +with a joint cross-view prior model for entropy coding. Thanks to this design, +LLSS processes left and right views in parallel, minimizing latency; all while +substantially improving R-D performance compared to both existing neural and +conventional codecs.",cs.CV,"['cs.CV', 'eess.IV']" +SPECAT: SPatial-spEctral Cumulative-Attention Transformer for High-Resolution Hyperspectral Image Reconstruction,Zhiyang Yao · Shuyang Liu · Xiaoyun Yuan · Lu Fang, ,,https://ieeexplore.ieee.org/document/10463068/,,,,,nan +MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation,Sumanth Udupa · Prajwal Gurunath · Aniruddh Sikdar · Suresh Sundaram,https://arxiv.org/abs/2311.18331,https://arxiv.org/abs/2311.18331v1,,2311.18331v1.pdf,MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation,"Deep neural networks have shown exemplary performance on semantic scene +understanding tasks on source domains, but due to the absence of style +diversity during training, enhancing performance on unseen target domains using +only single source domain data remains a challenging task. Generation of +simulated data is a feasible alternative to retrieving large style-diverse +real-world datasets as it is a cumbersome and budget-intensive process. +However, the large domain-specific inconsistencies between simulated and +real-world data pose a significant generalization challenge in semantic +segmentation. In this work, to alleviate this problem, we propose a novel +MultiResolution Feature Perturbation (MRFP) technique to randomize +domain-specific fine-grained features and perturb style of coarse features. Our +experimental results on various urban-scene segmentation datasets clearly +indicate that, along with the perturbation of style-information, perturbation +of fine-feature components is paramount to learn domain invariant robust +feature maps for semantic segmentation models. MRFP is a simple and +computationally efficient, transferable module with no additional learnable +parameters or objective functions, that helps state-of-the-art deep neural +networks to learn robust domain invariant features for simulation-to-real +semantic segmentation.",cs.CV,"['cs.CV', 'cs.AI']" +Theoretically Achieving Continuous Representation of Oriented Bounding Boxes,Zikai Xiao · Guo-Ye Yang · Xue Yang · Tai-Jiang Mu · Junchi Yan · Shi-Min Hu, ,https://arxiv.org/abs/2402.18975v1,,2402.18975v1.pdf,Theoretically Achieving Continuous Representation of Oriented Bounding Boxes,"Considerable efforts have been devoted to Oriented Object Detection (OOD). +However, one lasting issue regarding the discontinuity in Oriented Bounding Box +(OBB) representation remains unresolved, which is an inherent bottleneck for +extant OOD methods. This paper endeavors to completely solve this issue in a +theoretically guaranteed manner and puts an end to the ad-hoc efforts in this +direction. Prior studies typically can only address one of the two cases of +discontinuity: rotation and aspect ratio, and often inadvertently introduce +decoding discontinuity, e.g. Decoding Incompleteness (DI) and Decoding +Ambiguity (DA) as discussed in literature. Specifically, we propose a novel +representation method called Continuous OBB (COBB), which can be readily +integrated into existing detectors e.g. Faster-RCNN as a plugin. It can +theoretically ensure continuity in bounding box regression which to our best +knowledge, has not been achieved in literature for rectangle-based object +representation. For fairness and transparency of experiments, we have developed +a modularized benchmark based on the open-source deep learning framework +Jittor's detection toolbox JDet for OOD evaluation. On the popular DOTA +dataset, by integrating Faster-RCNN as the same baseline model, our new method +outperforms the peer method Gliding Vertex by 1.13% mAP50 (relative improvement +1.54%), and 2.46% mAP75 (relative improvement 5.91%), without any tricks.",cs.CV,"['cs.CV', 'cs.AI']" +MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors,He Zhang · Shenghao Ren · Haolei Yuan · Jianhui Zhao · Fan Li · Shuangpeng Sun · Zhenghao Liang · Tao Yu · Qiu Shen · Xun Cao,https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/,https://arxiv.org/abs/2403.17610,,2403.17610.pdf,MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors,"Foot contact is an important cue for human motion capture, understanding, and +generation. Existing datasets tend to annotate dense foot contact using visual +matching with thresholding or incorporating pressure signals. However, these +approaches either suffer from low accuracy or are only designed for small-range +and slow motion. There is still a lack of a vision-pressure multimodal dataset +with large-range and fast human motion, as well as accurate and dense +foot-contact annotation. To fill this gap, we propose a Multimodal MoCap +Dataset with Vision and Pressure sensors, named MMVP. MMVP provides accurate +and dense plantar pressure signals synchronized with RGBD observations, which +is especially useful for both plausible shape estimation, robust pose fitting +without foot drifting, and accurate global translation tracking. To validate +the dataset, we propose an RGBD-P SMPL fitting method and also a +monocular-video-based baseline framework, VP-MoCap, for human motion capture. +Experiments demonstrate that our RGBD-P SMPL Fitting results significantly +outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA +methods in foot-contact and global translation estimation accuracy. We believe +the configuration of the dataset and the baseline frameworks will stimulate the +research in this direction and also provide a good reference for MoCap +applications in various domains. Project page: +https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/.",cs.CV,['cs.CV'] +Learning Correlation Structures for Vision Transformers,Manjin Kim · Paul Hongsuck Seo · Cordelia Schmid · Minsu Cho, ,https://arxiv.org/abs/2404.03924,,2404.03924.pdf,Learning Correlation Structures for Vision Transformers,"We introduce a new attention mechanism, dubbed structural self-attention +(StructSA), that leverages rich correlation patterns naturally emerging in +key-query interactions of attention. StructSA generates attention maps by +recognizing space-time structures of key-query correlations via convolution and +uses them to dynamically aggregate local contexts of value features. This +effectively leverages rich structural patterns in images and videos such as +scene layouts, object motion, and inter-object relations. Using StructSA as a +main building block, we develop the structural vision transformer (StructViT) +and evaluate its effectiveness on both image and video classification tasks, +achieving state-of-the-art results on ImageNet-1K, Kinetics-400, +Something-Something V1 & V2, Diving-48, and FineGym.",cs.CV,['cs.CV'] +Image Restoration by Denoising Diffusion Models With Iteratively Preconditioned Guidance,Tomer Garber · Tom Tirer,https://github.com/tirer-lab/DDPG,https://arxiv.org/abs/2312.16519,,2312.16519.pdf,Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance,"Training deep neural networks has become a common approach for addressing +image restoration problems. An alternative for training a ""task-specific"" +network for each observation model is to use pretrained deep denoisers for +imposing only the signal's prior within iterative algorithms, without +additional training. Recently, a sampling-based variant of this approach has +become popular with the rise of diffusion/score-based generative models. Using +denoisers for general purpose restoration requires guiding the iterations to +ensure agreement of the signal with the observations. In low-noise settings, +guidance that is based on back-projection (BP) has been shown to be a promising +strategy (used recently also under the names ""pseudoinverse"" or +""range/null-space"" guidance). However, the presence of noise in the +observations hinders the gains from this approach. In this paper, we propose a +novel guidance technique, based on preconditioning that allows traversing from +BP-based guidance to least squares based guidance along the restoration scheme. +The proposed approach is robust to noise while still having much simpler +implementation than alternative methods (e.g., it does not require SVD or a +large number of iterations). We use it within both an optimization scheme and a +sampling-based scheme, and demonstrate its advantages over existing methods for +image deblurring and super-resolution.",eess.IV,"['eess.IV', 'cs.CV']" +Resource-Efficient Transformer Pruning for Finetuning of Large Models,Fatih Ilhan · Gong Su · Selim Tekin · Tiansheng Huang · Sihao Hu · Ling Liu, ,https://arxiv.org/abs/2403.14608,,2403.14608.pdf,Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey,"Large models represent a groundbreaking advancement in multiple application +fields, enabling remarkable achievements across various tasks. However, their +unprecedented scale comes with significant computational costs. These models, +often consisting of billions of parameters, require vast amounts of +computational resources for execution. Especially, the expansive scale and +computational demands pose considerable challenges when customizing them for +particular downstream tasks, particularly over the hardware platforms +constrained by computational capabilities. Parameter Efficient Fine-Tuning +(PEFT) provides a practical solution by efficiently adapt the large models over +the various downstream tasks. In particular, PEFT refers to the process of +adjusting the parameters of a pre-trained large models to adapt it to a +specific task while minimizing the number of additional parameters introduced +or computational resources required. This approach is particularly important +when dealing with large language models with high parameter counts, as +fine-tuning these models from scratch can be computationally expensive and +resource-intensive, posing considerable challenges in the supporting system +platform design. In this survey, we present comprehensive studies of various +PEFT algorithms, examining their performance and computational overhead. +Moreover, we provide an overview of applications developed using different PEFT +algorithms and discuss common techniques employed to mitigate computation costs +for PEFT. In addition to the algorithmic perspective, we overview various +real-world system designs to investigate the implementation costs associated +with different PEFT algorithms. This survey serves as an indispensable resource +for researchers aiming to understand both the PEFT algorithm and its system +implementation, offering detailed insights into recent advancements and +practical applications.",cs.LG,['cs.LG'] +HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation,Zhiying Leng · Tolga Birdal · Xiaohui Liang · Federico Tombari, ,https://arxiv.org/abs/2403.00372,,2403.00372.pdf,HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation,"3D shape generation from text is a fundamental task in 3D representation +learning. The text-shape pairs exhibit a hierarchical structure, where a +general text like ``chair"" covers all 3D shapes of the chair, while more +detailed prompts refer to more specific shapes. Furthermore, both text and 3D +shapes are inherently hierarchical structures. However, existing Text2Shape +methods, such as SDFusion, do not exploit that. In this work, we propose +HyperSDFusion, a dual-branch diffusion model that generates 3D shapes from a +given text. Since hyperbolic space is suitable for handling hierarchical data, +we propose to learn the hierarchical representations of text and 3D shapes in +hyperbolic space. First, we introduce a hyperbolic text-image encoder to learn +the sequential and multi-modal hierarchical features of text in hyperbolic +space. In addition, we design a hyperbolic text-graph convolution module to +learn the hierarchical features of text in hyperbolic space. In order to fully +utilize these text features, we introduce a dual-branch structure to embed text +features in 3D feature space. At last, to endow the generated 3D shapes with a +hierarchical structure, we devise a hyperbolic hierarchical loss. Our method is +the first to explore the hyperbolic hierarchical representation for +text-to-shape generation. Experimental results on the existing text-to-shape +paired dataset, Text2Shape, achieved state-of-the-art results. We release our +implementation under HyperSDFusion.github.io.",cs.CV,['cs.CV'] +Condition-Aware Neural Network for Controlled Image Generation,Han Cai · Muyang Li · Qinsheng Zhang · Ming-Yu Liu · Song Han, ,https://arxiv.org/abs/2404.01143,,2404.01143.pdf,Condition-Aware Neural Network for Controlled Image Generation,"We present Condition-Aware Neural Network (CAN), a new method for adding +control to image generative models. In parallel to prior conditional control +methods, CAN controls the image generation process by dynamically manipulating +the weight of the neural network. This is achieved by introducing a +condition-aware weight generation module that generates conditional weight for +convolution/linear layers based on the input condition. We test CAN on +class-conditional image generation on ImageNet and text-to-image generation on +COCO. CAN consistently delivers significant improvements for diffusion +transformer models, including DiT and UViT. In particular, CAN combined with +EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 +while requiring 52x fewer MACs per sampling step.",cs.CV,"['cs.CV', 'cs.AI']" +TULIP: Transformer for Upsampling of LiDAR Point Cloud,Bin Yang · Patrick Pfreundschuh · Roland Siegwart · Marco Hutter · Peyman Moghadam · Vaishakh Patil,https://github.com/ethz-asl/TULIP,https://arxiv.org/abs/2312.06733,,2312.06733.pdf,TULIP: Transformer for Upsampling of LiDAR Point Clouds,"LiDAR Upsampling is a challenging task for the perception systems of robots +and autonomous vehicles, due to the sparse and irregular structure of +large-scale scene contexts. Recent works propose to solve this problem by +converting LiDAR data from 3D Euclidean space into an image super-resolution +problem in 2D image space. Although their methods can generate high-resolution +range images with fine-grained details, the resulting 3D point clouds often +blur out details and predict invalid points. In this paper, we propose TULIP, a +new method to reconstruct high-resolution LiDAR point clouds from +low-resolution LiDAR input. We also follow a range image-based approach but +specifically modify the patch and window geometries of a Swin-Transformer-based +network to better fit the characteristics of range images. We conducted several +experiments on three public real-world and simulated datasets. TULIP +outperforms state-of-the-art methods in all relevant metrics and generates +robust and more realistic point clouds than prior works.",cs.CV,['cs.CV'] +PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving,Xinshuo Weng · Boris Ivanovic · Yan Wang · Yue Wang · Marco Pavone, ,https://arxiv.org/abs/2311.02077,,2311.02077.pdf,EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision,"We present EmerNeRF, a simple yet powerful approach for learning +spatial-temporal representations of dynamic driving scenes. Grounded in neural +fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, +and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: +First, it stratifies scenes into static and dynamic fields. This decomposition +emerges purely from self-supervision, enabling our model to learn from general, +in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field +from the dynamic field and uses this flow field to further aggregate +multi-frame features, amplifying the rendering precision of dynamic objects. +Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to +represent highly-dynamic scenes self-sufficiently, without relying on ground +truth object annotations or pre-trained models for dynamic object segmentation +or optical flow estimation. Our method achieves state-of-the-art performance in +sensor simulation, significantly outperforming previous methods when +reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In +addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual +foundation model features into 4D space-time and address a general positional +bias in modern Transformers, significantly boosting 3D perception performance +(e.g., 37.50% relative improvement in occupancy prediction accuracy on +average). Finally, we construct a diverse and challenging 120-sequence dataset +to benchmark neural fields under extreme and highly-dynamic settings.",cs.CV,['cs.CV'] +Driving Everywhere with Large Language Model Policy Adaptation,Boyi Li · Yue Wang · Jiageng Mao · Boris Ivanovic · Sushant Veer · Karen Leung · Marco Pavone, ,https://arxiv.org/abs/2402.05932,,2402.05932.pdf,Driving Everywhere with Large Language Model Policy Adaptation,"Adapting driving behavior to new environments, customs, and laws is a +long-standing problem in autonomous driving, precluding the widespread +deployment of autonomous vehicles (AVs). In this paper, we present LLaDA, a +simple yet powerful tool that enables human drivers and autonomous vehicles +alike to drive everywhere by adapting their tasks and motion plans to traffic +rules in new locations. LLaDA achieves this by leveraging the impressive +zero-shot generalizability of large language models (LLMs) in interpreting the +traffic rules in the local driver handbook. Through an extensive user study, we +show that LLaDA's instructions are useful in disambiguating in-the-wild +unexpected situations. We also demonstrate LLaDA's ability to adapt AV motion +planning policies in real-world datasets; LLaDA outperforms baseline planning +approaches on all our metrics. Please check our website for more details: +https://boyiliee.github.io/llada.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CL']" +Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer,Yuwen Tan · Qinhao Zhou · Xiang Xiang · Ke Wang · Yuchuan Wu · Yongbin Li, ,https://arxiv.org/abs/2403.19979,,2403.19979.pdf,Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer,"Class-incremental learning (CIL) aims to enable models to continuously learn +new classes while overcoming catastrophic forgetting. The introduction of +pre-trained models has brought new tuning paradigms to CIL. In this paper, we +revisit different parameter-efficient tuning (PET) methods within the context +of continual learning. We observe that adapter tuning demonstrates superiority +over prompt-based methods, even without parameter expansion in each learning +session. Motivated by this, we propose incrementally tuning the shared adapter +without imposing parameter update constraints, enhancing the learning capacity +of the backbone. Additionally, we employ feature sampling from stored +prototypes to retrain a unified classifier, further improving its performance. +We estimate the semantic shift of old prototypes without access to past samples +and update stored prototypes session by session. Our proposed method eliminates +model expansion and avoids retaining any image samples. It surpasses previous +pre-trained model-based CIL methods and demonstrates remarkable continual +learning capabilities. Experimental results on five CIL benchmarks validate the +effectiveness of our approach, achieving state-of-the-art (SOTA) performance.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +BilevelPruning: Unified Dynamic and Static Channel Pruning for Convolutional Neural Networks,Shangqian Gao · Yanfu Zhang · Feihu Huang · Heng Huang, ,https://arxiv.org/abs/2402.17862v1,,2402.17862v1.pdf,REPrune: Channel Pruning via Kernel Representative Selection,"Channel pruning is widely accepted to accelerate modern convolutional neural +networks (CNNs). The resulting pruned model benefits from its immediate +deployment on general-purpose software and hardware resources. However, its +large pruning granularity, specifically at the unit of a convolution filter, +often leads to undesirable accuracy drops due to the inflexibility of deciding +how and where to introduce sparsity to the CNNs. In this paper, we propose +REPrune, a novel channel pruning technique that emulates kernel pruning, fully +exploiting the finer but structured granularity. REPrune identifies similar +kernels within each channel using agglomerative clustering. Then, it selects +filters that maximize the incorporation of kernel representatives while +optimizing the maximum cluster coverage problem. By integrating with a +simultaneous training-pruning paradigm, REPrune promotes efficient, progressive +pruning throughout training CNNs, avoiding the conventional +train-prune-finetune sequence. Experimental results highlight that REPrune +performs better in computer vision tasks than existing methods, effectively +achieving a balance between acceleration ratio and performance retention.",cs.CV,"['cs.CV', 'cs.AI']" +DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing,Jia-Wei Liu · Yan-Pei Cao · Jay Zhangjie Wu · Weijia Mao · Yuchao Gu · Rui Zhao · Jussi Keppo · Ying Shan · Mike Zheng Shou, ,https://arxiv.org/abs/2310.10624,,2310.10624.pdf,DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing,"Despite recent progress in diffusion-based video editing, existing methods +are limited to short-length videos due to the contradiction between long-range +consistency and frame-wise editing. Prior attempts to address this challenge by +introducing video-2D representations encounter significant difficulties with +large-scale motion- and view-change videos, especially in human-centric +scenarios. To overcome this, we propose to introduce the dynamic Neural +Radiance Fields (NeRF) as the innovative video representation, where the +editing can be performed in the 3D spaces and propagated to the entire video +via the deformation field. To provide consistent and controllable editing, we +propose the image-based video-NeRF editing pipeline with a set of innovative +designs, including multi-view multi-pose Score Distillation Sampling (SDS) from +both the 2D personalized diffusion prior and 3D diffusion prior, reconstruction +losses, text-guided local parts super-resolution, and style transfer. Extensive +experiments demonstrate that our method, dubbed as DynVideo-E, significantly +outperforms SOTA approaches on two challenging datasets by a large margin of +50% ~ 95% for human preference. Code will be released at +https://showlab.github.io/DynVideo-E/.",cs.CV,['cs.CV'] +X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model,Lingmin Ran · Xiaodong Cun · Jia-Wei Liu · Rui Zhao · Song Zijie · Xintao Wang · Jussi Keppo · Mike Zheng Shou, ,https://arxiv.org/abs/2312.02238,,2312.02238.pdf,X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model,"We introduce X-Adapter, a universal upgrader to enable the pretrained +plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the +upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. +We achieve this goal by training an additional network to control the frozen +upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a +frozen copy of the old model to preserve the connectors of different plugins. +Additionally, X-Adapter adds trainable mapping layers that bridge the decoders +from models of different versions for feature remapping. The remapped features +will be used as guidance for the upgraded model. To enhance the guidance +ability of X-Adapter, we employ a null-text training strategy for the upgraded +model. After training, we also introduce a two-stage denoising strategy to +align the initial latents of X-Adapter and the upgraded model. Thanks to our +strategies, X-Adapter demonstrates universal compatibility with various plugins +and also enables plugins of different versions to work together, thereby +expanding the functionalities of diffusion community. To verify the +effectiveness of the proposed method, we conduct extensive experiments and the +results show that X-Adapter may facilitate wider application in the upgraded +foundational diffusion model.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']" +MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model,Zhongcong Xu · Jianfeng Zhang · Jun Hao Liew · Hanshu Yan · Jia-Wei Liu · Chenxu Zhang · Jiashi Feng · Mike Zheng Shou, ,https://arxiv.org/abs/2311.16498,,2311.16498.pdf,MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model,"This paper studies the human image animation task, which aims to generate a +video of a certain reference identity following a particular motion sequence. +Existing animation works typically employ the frame-warping technique to +animate the reference image towards the target motion. Despite achieving +reasonable results, these approaches face challenges in maintaining temporal +consistency throughout the animation due to the lack of temporal modeling and +poor preservation of reference identity. In this work, we introduce +MagicAnimate, a diffusion-based framework that aims at enhancing temporal +consistency, preserving reference image faithfully, and improving animation +fidelity. To achieve this, we first develop a video diffusion model to encode +temporal information. Second, to maintain the appearance coherence across +frames, we introduce a novel appearance encoder to retain the intricate details +of the reference image. Leveraging these two innovations, we further employ a +simple video fusion technique to encourage smooth transitions for long video +animation. Empirical results demonstrate the superiority of our method over +baseline approaches on two benchmarks. Notably, our approach outperforms the +strongest baseline by over 38% in terms of video fidelity on the challenging +TikTok dancing dataset. Code and model will be made available.",cs.CV,"['cs.CV', 'cs.GR']" +VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence,Yuchao Gu · Yipin Zhou · Bichen Wu · Licheng Yu · Jia-Wei Liu · Rui Zhao · Jay Zhangjie Wu · David Junhao Zhang · Mike Zheng Shou · Kevin Tang, ,https://arxiv.org/abs/2312.02087,,2312.02087.pdf,VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence,"Current diffusion-based video editing primarily focuses on +structure-preserved editing by utilizing various dense correspondences to +ensure temporal consistency and motion alignment. However, these approaches are +often ineffective when the target edit involves a shape change. To embark on +video editing with shape change, we explore customized video subject swapping +in this work, where we aim to replace the main subject in a source video with a +target subject having a distinct identity and potentially different shape. In +contrast to previous methods that rely on dense correspondences, we introduce +the VideoSwap framework that exploits semantic point correspondences, inspired +by our observation that only a small number of semantic points are necessary to +align the subject's motion trajectory and modify its shape. We also introduce +various user-point interactions (\eg, removing points and dragging points) to +address various semantic point correspondence. Extensive experiments +demonstrate state-of-the-art video subject swapping results across a variety of +real-world videos.",cs.CV,['cs.CV'] +LIVE: Online Large Video-Language Model for Streaming Video,Joya Chen · Zhaoyang Lv · Shiwei Wu · Kevin Qinghong Lin · Chenan Song · Difei Gao · Jia-Wei Liu · Ziteng Gao · Dongxing Mao · Mike Zheng Shou, ,https://arxiv.org/abs/2405.16009,,2405.16009.pdf,Streaming Long Video Understanding with Large Language Models,"This paper presents VideoStreaming, an advanced vision-language large model +(VLLM) for video understanding, that capably understands arbitrary-length video +with a constant number of video tokens streamingly encoded and adaptively +selected. The challenge of video understanding in the vision language area +mainly lies in the significant computational burden caused by the great number +of tokens extracted from long videos. Previous works rely on sparse sampling or +frame compression to reduce tokens. However, such approaches either disregard +temporal information in a long time span or sacrifice spatial details, +resulting in flawed compression. To address these limitations, our +VideoStreaming has two core designs: Memory-Propagated Streaming Encoding and +Adaptive Memory Selection. The Memory-Propagated Streaming Encoding +architecture segments long videos into short clips and sequentially encodes +each clip with a propagated memory. In each iteration, we utilize the encoded +results of the preceding clip as historical memory, which is integrated with +the current clip to distill a condensed representation that encapsulates the +video content up to the current timestamp. After the encoding process, the +Adaptive Memory Selection strategy selects a constant number of +question-related memories from all the historical memories and feeds them into +the LLM to generate informative responses. The question-related selection +reduces redundancy within the memories, enabling efficient and precise video +understanding. Meanwhile, the disentangled video extraction and reasoning +design allows the LLM to answer different questions about a video by directly +selecting corresponding memories, without the need to encode the whole video +for each question. Our model achieves superior performance and higher +efficiency on long video benchmarks, showcasing precise temporal comprehension +for detailed question answering.",cs.CV,['cs.CV'] +Restoration by Generation with Constrained Priors,Zheng Ding · Xuaner Zhang · Zhuowen Tu · Zhihao Xia,https://gen2res.github.io,https://arxiv.org/abs/2312.17161,,2312.17161.pdf,Restoration by Generation with Constrained Priors,"The inherent generative power of denoising diffusion models makes them +well-suited for image restoration tasks where the objective is to find the +optimal high-quality image within the generative space that closely resembles +the input image. We propose a method to adapt a pretrained diffusion model for +image restoration by simply adding noise to the input image to be restored and +then denoise. Our method is based on the observation that the space of a +generative model needs to be constrained. We impose this constraint by +finetuning the generative model with a set of anchor images that capture the +characteristics of the input image. With the constrained space, we can then +leverage the sampling strategy used for generation to do image restoration. We +evaluate against previous methods and show superior performances on multiple +real-world restoration datasets in preserving identity and image quality. We +also demonstrate an important and practical application on personalized +restoration, where we use a personal album as the anchor images to constrain +the generative space. This approach allows us to produce results that +accurately preserve high-frequency details, which previous works are unable to +do. Project webpage: https://gen2res.github.io.",cs.CV,['cs.CV'] +3D Multi-frame Fusion for Video Stabilization,Zhan Peng · Xinyi Ye · Weiyue Zhao · TIANQI LIU · Huiqiang Sun · Baopu Li · Zhiguo Cao, ,https://arxiv.org/abs/2404.12887,,2404.12887.pdf,3D Multi-frame Fusion for Video Stabilization,"In this paper, we present RStab, a novel framework for video stabilization +that integrates 3D multi-frame fusion through volume rendering. Departing from +conventional methods, we introduce a 3D multi-frame perspective to generate +stabilized images, addressing the challenge of full-frame generation while +preserving structure. The core of our approach lies in Stabilized Rendering +(SR), a volume rendering module, which extends beyond the image fusion by +incorporating feature fusion. The core of our RStab framework lies in +Stabilized Rendering (SR), a volume rendering module, fusing multi-frame +information in 3D space. Specifically, SR involves warping features and colors +from multiple frames by projection, fusing them into descriptors to render the +stabilized image. However, the precision of warped information depends on the +projection accuracy, a factor significantly influenced by dynamic regions. In +response, we introduce the Adaptive Ray Range (ARR) module to integrate depth +priors, adaptively defining the sampling range for the projection process. +Additionally, we propose Color Correction (CC) assisting geometric constraints +with optical flow for accurate color aggregation. Thanks to the three modules, +our RStab demonstrates superior performance compared with previous stabilizers +in the field of view (FOV), image quality, and video stability across various +datasets.",cs.CV,"['cs.CV', 'eess.IV']" +3D Facial Expressions through Analysis-by-Neural-Synthesis,George Retsinas · Panagiotis Filntisis · Radek Danecek · Victoria Abrevaya · Anastasios Roussos · Timo Bolkart · Petros Maragos,https://georgeretsi.github.io/smirk/,https://arxiv.org/abs/2404.04104,,2404.04104.pdf,3D Facial Expressions through Analysis-by-Neural-Synthesis,"While existing methods for 3D face reconstruction from in-the-wild images +excel at recovering the overall face shape, they commonly miss subtle, extreme, +asymmetric, or rarely observed expressions. We improve upon these methods with +SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which +faithfully reconstructs expressive 3D faces from images. We identify two key +limitations in existing methods: shortcomings in their self-supervised training +formulation, and a lack of expression diversity in the training images. For +training, most methods employ differentiable rendering to compare a predicted +face mesh with the input image, along with a plethora of additional loss +functions. This differentiable rendering loss not only has to provide +supervision to optimize for 3D face geometry, camera, albedo, and lighting, +which is an ill-posed optimization problem, but the domain gap between +rendering and input image further hinders the learning process. Instead, SMIRK +replaces the differentiable rendering with a neural rendering module that, +given the rendered predicted mesh geometry, and sparsely sampled pixels of the +input image, generates a face image. As the neural rendering gets color +information from sampled image pixels, supervising with neural rendering-based +reconstruction loss can focus solely on the geometry. Further, it enables us to +generate images of the input identity with varying expressions while training. +These are then utilized as input to the reconstruction model and used as +supervision with ground truth geometry. This effectively augments the training +data and enhances the generalization for diverse expressions. Our qualitative, +quantitative and particularly our perceptual evaluations demonstrate that SMIRK +achieves the new state-of-the art performance on accurate expression +reconstruction. Project webpage: https://georgeretsi.github.io/smirk/.",cs.CV,['cs.CV'] +SVGDreamer: Text Guided SVG Generation with Diffusion Model,XiMing Xing · Chuang Wang · Haitao Zhou · Jing Zhang · Dong Xu · Qian Yu,https://github.com/ximinng/SVGDreamer,https://arxiv.org/abs/2312.16476,,2312.16476.pdf,SVGDreamer: Text Guided SVG Generation with Diffusion Model,"Recently, text-guided scalable vector graphics (SVGs) synthesis has shown +promise in domains such as iconography and sketch. However, existing +text-to-SVG generation methods lack editability and struggle with visual +quality and result diversity. To address these limitations, we propose a novel +text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer +incorporates a semantic-driven image vectorization (SIVE) process that enables +the decomposition of synthesis into foreground objects and background, thereby +enhancing editability. Specifically, the SIVE process introduces +attention-based primitive control and an attention-mask loss function for +effective control and manipulation of individual elements. Additionally, we +propose a Vectorized Particle-based Score Distillation (VPSD) approach to +address issues of shape over-smoothing, color over-saturation, limited +diversity, and slow convergence of the existing text-to-SVG generation methods +by modeling SVGs as distributions of control points and colors. Furthermore, +VPSD leverages a reward model to re-weight vector particles, which improves +aesthetic appeal and accelerates convergence. Extensive experiments are +conducted to validate the effectiveness of SVGDreamer, demonstrating its +superiority over baseline methods in terms of editability, visual quality, and +diversity. Project page: +\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}",cs.CV,"['cs.CV', 'cs.AI']" +ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations,Rwiddhi Chakraborty · Adrian de Sena Sletten · Michael C. Kampffmeyer, ,https://arxiv.org/abs/2403.13870,,2403.13870.pdf,ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations,"Group robustness strategies aim to mitigate learned biases in deep learning +models that arise from spurious correlations present in their training +datasets. However, most existing methods rely on the access to the label +distribution of the groups, which is time-consuming and expensive to obtain. As +a result, unsupervised group robustness strategies are sought. Based on the +insight that a trained model's classification strategies can be inferred +accurately based on explainability heatmaps, we introduce ExMap, an +unsupervised two stage mechanism designed to enhance group robustness in +traditional classifiers. ExMap utilizes a clustering module to infer +pseudo-labels based on a model's explainability heatmaps, which are then used +during training in lieu of actual labels. Our empirical studies validate the +efficacy of ExMap - We demonstrate that it bridges the performance gap with its +supervised counterparts and outperforms existing partially supervised and +unsupervised methods. Additionally, ExMap can be seamlessly integrated with +existing group robustness learning strategies. Finally, we demonstrate its +potential in tackling the emerging issue of multiple shortcut +mitigation\footnote{Code available at \url{https://github.com/rwchakra/exmap}}.",cs.CV,"['cs.CV', 'cs.LG']" +Learning Triangular Distribution in Visual World,Ping Chen · Xingpeng Zhang · Chengtao Zhou · dichao Fan · Peng Tu · Le Zhang · Yanlin Qian, ,https://arxiv.org/abs/2311.18605,,2311.18605.pdf,Learning Triangular Distribution in Visual World,"Convolution neural network is successful in pervasive vision tasks, including +label distribution learning, which usually takes the form of learning an +injection from the non-linear visual features to the well-defined labels. +However, how the discrepancy between features is mapped to the label +discrepancy is ambient, and its correctness is not guaranteed.To address these +problems, we study the mathematical connection between feature and its label, +presenting a general and simple framework for label distribution learning. We +propose a so-called Triangular Distribution Transform (TDT) to build an +injective function between feature and label, guaranteeing that any symmetric +feature discrepancy linearly reflects the difference between labels. The +proposed TDT can be used as a plug-in in mainstream backbone networks to +address different label distribution learning tasks. Experiments on Facial Age +Recognition, Illumination Chromaticity Estimation, and Aesthetics assessment +show that TDT achieves on-par or better results than the prior arts.",cs.CV,['cs.CV'] +ToonerGAN: Reinforcing GANs for Obfuscating Automated Facial Indexing,Kartik Thakral · Shashikant Prasad · Stuti Aswani · Mayank Vatsa · Richa Singh, ,,https://github.com/Kartik-3004/facexformer,,,,,nan +Boosting Adversarial Training via Fisher-Rao Norm-based Regularization,Xiangyu Yin · Wenjie Ruan, ,https://arxiv.org/abs/2403.17520,,2403.17520.pdf,Boosting Adversarial Training via Fisher-Rao Norm-based Regularization,"Adversarial training is extensively utilized to improve the adversarial +robustness of deep neural networks. Yet, mitigating the degradation of standard +generalization performance in adversarial-trained models remains an open +problem. This paper attempts to resolve this issue through the lens of model +complexity. First, We leverage the Fisher-Rao norm, a geometrically invariant +metric for model complexity, to establish the non-trivial bounds of the +Cross-Entropy Loss-based Rademacher complexity for a ReLU-activated Multi-Layer +Perceptron. Then we generalize a complexity-related variable, which is +sensitive to the changes in model width and the trade-off factors in +adversarial training. Moreover, intensive empirical evidence validates that +this variable highly correlates with the generalization gap of Cross-Entropy +loss between adversarial-trained and standard-trained models, especially during +the initial and final phases of the training process. Building upon this +observation, we propose a novel regularization framework, called Logit-Oriented +Adversarial Training (LOAT), which can mitigate the trade-off between +robustness and accuracy while imposing only a negligible increase in +computational overhead. Our extensive experiments demonstrate that the proposed +regularization strategy can boost the performance of the prevalent adversarial +training algorithms, including PGD-AT, TRADES, TRADES (LSE), MART, and DM-AT, +across various network architectures. Our code will be available at +https://github.com/TrustAI/LOAT.",cs.LG,"['cs.LG', 'cs.CV']" +CORES: Convolutional Response-based Score for Out-of-distribution Detection,Keke Tang · Chao Hou · Weilong Peng · Runnan Chen · Peican Zhu · Wenping Wang · Zhihong Tian, ,https://arxiv.org/abs/2405.01662,,2405.01662.pdf,Out-of-distribution detection based on subspace projection of high-dimensional features output by the last convolutional layer,"Out-of-distribution (OOD) detection, crucial for reliable pattern +classification, discerns whether a sample originates outside the training +distribution. This paper concentrates on the high-dimensional features output +by the final convolutional layer, which contain rich image features. Our key +idea is to project these high-dimensional features into two specific feature +subspaces, leveraging the dimensionality reduction capacity of the network's +linear layers, trained with Predefined Evenly-Distribution Class Centroids +(PEDCC)-Loss. This involves calculating the cosines of three projection angles +and the norm values of features, thereby identifying distinctive information +for in-distribution (ID) and OOD data, which assists in OOD detection. Building +upon this, we have modified the batch normalization (BN) and ReLU layer +preceding the fully connected layer, diminishing their impact on the output +feature distributions and thereby widening the distribution gap between ID and +OOD data features. Our method requires only the training of the classification +network model, eschewing any need for input pre-processing or specific OOD data +pre-tuning. Extensive experiments on several benchmark datasets demonstrates +that our approach delivers state-of-the-art performance. Our code is available +at https://github.com/Hewell0/ProjOOD.",cs.CV,['cs.CV'] +Higher-order Relational Reasoning for Pedestrian Trajectory Prediction,Sungjune Kim · Hyung-gun Chi · Hyerin Lim · Karthik Ramani · Jinkyu Kim · Sangpil Kim, ,https://arxiv.org/abs/2403.08032,,2403.08032.pdf,LG-Traj: LLM Guided Pedestrian Trajectory Prediction,"Accurate pedestrian trajectory prediction is crucial for various +applications, and it requires a deep understanding of pedestrian motion +patterns in dynamic environments. However, existing pedestrian trajectory +prediction methods still need more exploration to fully leverage these motion +patterns. This paper investigates the possibilities of using Large Language +Models (LLMs) to improve pedestrian trajectory prediction tasks by inducing +motion cues. We introduce LG-Traj, a novel approach incorporating LLMs to +generate motion cues present in pedestrian past/observed trajectories. Our +approach also incorporates motion cues present in pedestrian future +trajectories by clustering future trajectories of training data using a mixture +of Gaussians. These motion cues, along with pedestrian coordinates, facilitate +a better understanding of the underlying representation. Furthermore, we +utilize singular value decomposition to augment the observed trajectories, +incorporating them into the model learning process to further enhance +representation learning. Our method employs a transformer-based architecture +comprising a motion encoder to model motion patterns and a social decoder to +capture social interactions among pedestrians. We demonstrate the effectiveness +of our approach on popular pedestrian trajectory prediction benchmarks, namely +ETH-UCY and SDD, and present various ablation experiments to validate our +approach.",cs.CV,"['cs.CV', 'cs.AI']" +LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising,Yuxing Duan, ,https://arxiv.org/abs/2405.19718,,2405.19718.pdf,LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising,"Event camera has significant advantages in capturing dynamic scene +information while being prone to noise interference, particularly in +challenging conditions like low threshold and low illumination. However, most +existing research focuses on gentle situations, hindering event camera +applications in realistic complex scenarios. To tackle this limitation and +advance the field, we construct a new paired real-world event denoising dataset +(LED), including 3K sequences with 18K seconds of high-resolution (1200*680) +event streams and showing three notable distinctions compared to others: +diverse noise levels and scenes, larger-scale with high-resolution, and +high-quality GT. Specifically, it contains stepped parameters and varying +illumination with diverse scenarios. Moreover, based on the property of noise +events inconsistency and signal events consistency, we propose a novel +effective denoising framework(DED) using homogeneous dual events to generate +the GT with better separating noise from the raw. Furthermore, we design a +bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with +dynamic thresholds to realize accurate denoising. The experimental results +demonstrate that the remarkable performance of the proposed approach on +different datasets.The dataset and code are at https://github.com/Yee-Sing/led.",cs.CV,['cs.CV'] +Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds,Yujia Liu · Anton Obukhov · Jan D. Wegner · Konrad Schindler, ,https://arxiv.org/abs/2312.04962,,2312.04962.pdf,Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds,"Computer-Aided Design (CAD) model reconstruction from point clouds is an +important problem at the intersection of computer vision, graphics, and machine +learning; it saves the designer significant time when iterating on in-the-wild +objects. Recent advancements in this direction achieve relatively reliable +semantic segmentation but still struggle to produce an adequate topology of the +CAD model. In this work, we analyze the current state of the art for that +ill-posed task and identify shortcomings of existing methods. We propose a +hybrid analytic-neural reconstruction scheme that bridges the gap between +segmented point clouds and structured CAD models and can be readily combined +with different segmentation backbones. Moreover, to power the surface fitting +stage, we propose a novel implicit neural representation of freeform surfaces, +driving up the performance of our overall CAD reconstruction scheme. We +extensively evaluate our method on the popular ABC benchmark of CAD models and +set a new state-of-the-art for that dataset. Project page: +https://www.obukhov.ai/point2cad}{https://www.obukhov.ai/point2cad.",cs.CV,['cs.CV'] +Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation,Hoang Chuong Nguyen · Tianyu Wang · Jose M. Alvarez · Miaomiao Liu, ,https://arxiv.org/abs/2404.14908,,2404.14908.pdf,Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation,"This paper focuses on self-supervised monocular depth estimation in dynamic +scenes trained on monocular videos. Existing methods jointly estimate +pixel-wise depth and motion, relying mainly on an image reconstruction loss. +Dynamic regions1 remain a critical challenge for these methods due to the +inherent ambiguity in depth and motion estimation, resulting in inaccurate +depth estimation. This paper proposes a self-supervised training framework +exploiting pseudo depth labels for dynamic regions from training data. The key +contribution of our framework is to decouple depth estimation for static and +dynamic regions of images in the training data. We start with an unsupervised +depth estimation approach, which provides reliable depth estimates for static +regions and motion cues for dynamic regions and allows us to extract moving +object information at the instance level. In the next stage, we use an object +network to estimate the depth of those moving objects assuming rigid motions. +Then, we propose a new scale alignment module to address the scale ambiguity +between estimated depths for static and dynamic regions. We can then use the +depth labels generated to train an end-to-end depth estimation network and +improve its performance. Extensive experiments on the Cityscapes and KITTI +datasets show that our self-training strategy consistently outperforms existing +self/unsupervised depth estimation methods.",cs.CV,['cs.CV'] +Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline,Anas Al-lahham · Muhammad Zaigham Zaheer · Nurbek Tastan · Karthik Nandakumar,https://anasemad11.github.io/CLAP/,https://arxiv.org/abs/2404.00847,,2404.00847.pdf,Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline,"Unsupervised (US) video anomaly detection (VAD) in surveillance applications +is gaining more popularity recently due to its practical real-world +applications. As surveillance videos are privacy sensitive and the availability +of large-scale video data may enable better US-VAD systems, collaborative +learning can be highly rewarding in this setting. However, due to the extremely +challenging nature of the US-VAD task, where learning is carried out without +any annotations, privacy-preserving collaborative learning of US-VAD systems +has not been studied yet. In this paper, we propose a new baseline for anomaly +detection capable of localizing anomalous events in complex surveillance videos +in a fully unsupervised fashion without any labels on a privacy-preserving +participant-based distributed training configuration. Additionally, we propose +three new evaluation protocols to benchmark anomaly detection approaches on +various scenarios of collaborations and data availability. Based on these +protocols, we modify existing VAD datasets to extensively evaluate our approach +as well as existing US SOTA methods on two large-scale datasets including +UCF-Crime and XD-Violence. All proposed evaluation protocols, dataset splits, +and codes are available here: https://github.com/AnasEmad11/CLAP",cs.CV,['cs.CV'] +Unlocking Pretrained Image Backbones for Semantic Image Synthesis,Tariq Berrada · Jakob Verbeek · camille couprie · Karteek Alahari, ,https://arxiv.org/abs/2312.13314,,2312.13314.pdf,Unlocking Pre-trained Image Backbones for Semantic Image Synthesis,"Semantic image synthesis, i.e., generating images from user-provided semantic +label maps, is an important conditional image generation task as it allows to +control both the content as well as the spatial layout of generated images. +Although diffusion models have pushed the state of the art in generative image +modeling, the iterative nature of their inference process makes them +computationally demanding. Other approaches such as GANs are more efficient as +they only need a single feed-forward pass for generation, but the image quality +tends to suffer on large and diverse datasets. In this work, we propose a new +class of GAN discriminators for semantic image synthesis that generates highly +realistic images by exploiting feature backbone networks pre-trained for tasks +such as image classification. We also introduce a new generator architecture +with better context modeling and using cross-attention to inject noise into +latent variables, leading to more diverse generated images. Our model, which we +dub DP-SIMS, achieves state-of-the-art results in terms of image quality and +consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, +surpassing recent diffusion models while requiring two orders of magnitude less +compute for inference.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision,Yi Yu · Xue Yang · Qingyun Li · Feipeng Da · Jifeng Dai · Yu Qiao · Junchi Yan, ,https://arxiv.org/abs/2311.14758,,2311.14758.pdf,Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision,"With the rapidly increasing demand for oriented object detection (OOD), +recent research involving weakly-supervised detectors for learning rotated box +(RBox) from the horizontal box (HBox) has attracted more and more attention. In +this paper, we explore a more challenging yet label-efficient setting, namely +single point-supervised OOD, and present our approach called Point2RBox. +Specifically, we propose to leverage two principles: 1) Synthetic pattern +knowledge combination: By sampling around each labeled point on the image, we +spread the object feature to synthetic visual patterns with known boxes to +provide the knowledge for box regression. 2) Transform self-supervision: With a +transformed input image (e.g. scaled/rotated), the output RBoxes are trained to +follow the same transformation so that the network can perceive the relative +size/rotation between objects. The detector is further enhanced by a few +devised techniques to cope with peripheral issues, e.g. the anchor/layer +assignment as the size of the object is not available in our point supervision +setting. To our best knowledge, Point2RBox is the first end-to-end solution for +point-supervised OOD. In particular, our method uses a lightweight paradigm, +yet it achieves a competitive performance among point-supervised alternatives, +41.05%/27.62%/80.01% on DOTA/DIOR/HRSC datasets.",cs.CV,"['cs.CV', 'cs.AI']" +A Versatile Framework for Continual Test-Time Domain Adaptation: Balancing Discriminability and Generalizability,Xu Yang · Xuan chen · Moqi Li · Kun Wei · Cheng Deng, ,https://arxiv.org/abs/2405.14602,,2405.14602.pdf,Controllable Continual Test-Time Adaptation,"Continual Test-Time Adaptation (CTTA) is an emerging and challenging task +where a model trained in a source domain must adapt to continuously changing +conditions during testing, without access to the original source data. CTTA is +prone to error accumulation due to uncontrollable domain shifts, leading to +blurred decision boundaries between categories. Existing CTTA methods primarily +focus on suppressing domain shifts, which proves inadequate during the +unsupervised test phase. In contrast, we introduce a novel approach that guides +rather than suppresses these shifts. Specifically, we propose +$\textbf{C}$ontrollable $\textbf{Co}$ntinual $\textbf{T}$est-$\textbf{T}$ime +$\textbf{A}$daptation (C-CoTTA), which explicitly prevents any single category +from encroaching on others, thereby mitigating the mutual influence between +categories caused by uncontrollable shifts. Moreover, our method reduces the +sensitivity of model to domain transformations, thereby minimizing the +magnitude of category shifts. Extensive quantitative experiments demonstrate +the effectiveness of our method, while qualitative analyses, such as t-SNE +plots, confirm the theoretical validity of our approach.",cs.LG,['cs.LG'] +Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning,xin zhang · Jiawei Du · Weiying Xie · Yunsong Li · Joey Tianyi Zhou, ,https://arxiv.org/abs/2311.13613,,2311.13613.pdf,Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning,"Dataset pruning aims to construct a coreset capable of achieving performance +comparable to the original, full dataset. Most existing dataset pruning methods +rely on snapshot-based criteria to identify representative samples, often +resulting in poor generalization across various pruning and cross-architecture +scenarios. Recent studies have addressed this issue by expanding the scope of +training dynamics considered, including factors such as forgetting event and +probability change, typically using an averaging approach. However, these works +struggle to integrate a broader range of training dynamics without overlooking +well-generalized samples, which may not be sufficiently highlighted in an +averaging manner. In this study, we propose a novel dataset pruning method +termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS +utilizes a dual-depth strategy to achieve a balance between incorporating +extensive training dynamics and identifying representative samples for dataset +pruning. In the first depth, we estimate the series of each sample's individual +contributions spanning the training progress, ensuring comprehensive +integration of training dynamics. In the second depth, we focus on the +variability of the sample-wise contributions identified in the first depth to +highlight well-generalized samples. Extensive experiments conducted on CIFAR +and ImageNet datasets verify the superiority of TDDS over previous SOTA +methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with +only 10% training data, surpassing random selection by 7.83% and other +comparison methods by at least 12.69%.",cs.CV,"['cs.CV', 'cs.LG']" +FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models,Ao Luo · XIN LI · Fan Yang · Jiangyu Liu · Haoqiang Fan · Shuaicheng Liu, ,https://arxiv.org/html/2312.01746v1,,2312.01746v1.pdf,Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation,"Recently, Google proposes DDVM which for the first time demonstrates that a +general diffusion model for image-to-image translation task works impressively +well on optical flow estimation task without any specific designs like RAFT. +However, DDVM is still a closed-source model with the expensive and private +Palette-style pretraining. In this technical report, we present the first +open-source DDVM by reproducing it. We study several design choices and find +those important ones. By training on 40k public data with 4 GPUs, our +reproduction achieves comparable performance to the closed-source DDVM. The +code and model have been released in +https://github.com/DQiaole/FlowDiffusion_pytorch.",cs.CV,['cs.CV'] +Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing,Jan-Nico Zaech · Martin Danelljan · Tolga Birdal · Luc Van Gool, ,https://arxiv.org/abs/2310.12153,,2310.12153.pdf,Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing,"Adiabatic quantum computing (AQC) is a promising approach for discrete and +often NP-hard optimization problems. Current AQCs allow to implement problems +of research interest, which has sparked the development of quantum +representations for many computer vision tasks. Despite requiring multiple +measurements from the noisy AQC, current approaches only utilize the best +measurement, discarding information contained in the remaining ones. In this +work, we explore the potential of using this information for probabilistic +balanced k-means clustering. Instead of discarding non-optimal solutions, we +propose to use them to compute calibrated posterior probabilities with little +additional compute cost. This allows us to identify ambiguous solutions and +data points, which we demonstrate on a D-Wave AQC on synthetic tasks and real +visual data.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']" +OMG-Seg: Is One Model Good Enough For All Segmentation?,Xiangtai Li · Haobo Yuan · Wei Li · Henghui Ding · Size Wu · Wenwei Zhang · Yining Li · Kai Chen · Chen Change Loy, ,https://arxiv.org/abs/2401.10229,,2401.10229.pdf,OMG-Seg: Is One Model Good Enough For All Segmentation?,"In this work, we address various segmentation tasks, each traditionally +tackled by distinct or partially unified models. We propose OMG-Seg, One Model +that is Good enough to efficiently and effectively handle all the segmentation +tasks, including image semantic, instance, and panoptic segmentation, as well +as their video counterparts, open vocabulary settings, prompt-driven, +interactive segmentation like SAM, and video object segmentation. To our +knowledge, this is the first model to handle all these tasks in one model and +achieve satisfactory performance. We show that OMG-Seg, a transformer-based +encoder-decoder architecture with task-specific queries and outputs, can +support over ten distinct segmentation tasks and yet significantly reduce +computational and parameter overhead across various tasks and datasets. We +rigorously evaluate the inter-task influences and correlations during +co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.",cs.CV,['cs.CV'] +Towards Fairness-Aware Adversarial Learning,Yanghao Zhang · Tianle Zhang · Ronghui Mu · Xiaowei Huang · Wenjie Ruan, ,https://arxiv.org/abs/2402.17729,,2402.17729.pdf,Towards Fairness-Aware Adversarial Learning,"Although adversarial training (AT) has proven effective in enhancing the +model's robustness, the recently revealed issue of fairness in robustness has +not been well addressed, i.e. the robust accuracy varies significantly among +different categories. In this paper, instead of uniformly evaluating the +model's average class performance, we delve into the issue of robust fairness, +by considering the worst-case distribution across various classes. We propose a +novel learning paradigm, named Fairness-Aware Adversarial Learning (FAAL). As a +generalization of conventional AT, we re-define the problem of adversarial +training as a min-max-max framework, to ensure both robustness and fairness of +the trained model. Specifically, by taking advantage of distributional robust +optimization, our method aims to find the worst distribution among different +categories, and the solution is guaranteed to obtain the upper bound +performance with high probability. In particular, FAAL can fine-tune an unfair +robust model to be fair within only two epochs, without compromising the +overall clean and robust accuracies. Extensive experiments on various image +datasets validate the superior performance and efficiency of the proposed FAAL +compared to other state-of-the-art methods.",cs.CV,['cs.CV'] +Inter-X: Towards Versatile Human-Human Interaction Analysis,Liang Xu · Xintao Lv · Yichao Yan · Xin Jin · Wu Shuwen · Congsheng Xu · Yifan Liu · Yizhou Zhou · Fengyun Rao · Xingdong Sheng · Yunhui LIU · Wenjun Zeng · Xiaokang Yang, ,https://arxiv.org/abs/2312.16051,,2312.16051.pdf,Inter-X: Towards Versatile Human-Human Interaction Analysis,"The analysis of the ubiquitous human-human interactions is pivotal for +understanding humans as social beings. Existing human-human interaction +datasets typically suffer from inaccurate body motions, lack of hand gestures +and fine-grained textual descriptions. To better perceive and generate +human-human interactions, we propose Inter-X, a currently largest human-human +interaction dataset with accurate body movements and diverse interaction +patterns, together with detailed hand gestures. The dataset includes ~11K +interaction sequences and more than 8.1M frames. We also equip Inter-X with +versatile annotations of more than 34K fine-grained human part-level textual +descriptions, semantic interaction categories, interaction order, and the +relationship and personality of the subjects. Based on the elaborate +annotations, we propose a unified benchmark composed of 4 categories of +downstream tasks from both the perceptual and generative directions. Extensive +experiments and comprehensive analysis show that Inter-X serves as a testbed +for promoting the development of versatile human-human interaction analysis. +Our dataset and benchmark will be publicly available for research purposes.",cs.CV,['cs.CV'] +ReGenNet: Towards Human Action-Reaction Synthesis,Liang Xu · Yizhou Zhou · Yichao Yan · Xin Jin · Wenhan Zhu · Fengyun Rao · Xiaokang Yang · Wenjun Zeng, ,https://arxiv.org/abs/2403.11882,,2403.11882.pdf,ReGenNet: Towards Human Action-Reaction Synthesis,"Humans constantly interact with their surrounding environments. Current +human-centric generative models mainly focus on synthesizing humans plausibly +interacting with static scenes and objects, while the dynamic human +action-reaction synthesis for ubiquitous causal human-human interactions is +less explored. Human-human interactions can be regarded as asymmetric with +actors and reactors in atomic interaction periods. In this paper, we +comprehensively analyze the asymmetric, dynamic, synchronous, and detailed +nature of human-human interactions and propose the first multi-setting human +action-reaction synthesis benchmark to generate human reactions conditioned on +given human actions. To begin with, we propose to annotate the actor-reactor +order of the interaction sequences for the NTU120, InterHuman, and Chi3D +datasets. Based on them, a diffusion-based generative model with a Transformer +decoder architecture called ReGenNet together with an explicit distance-based +interaction loss is proposed to predict human reactions in an online manner, +where the future states of actors are unavailable to reactors. Quantitative and +qualitative results show that our method can generate instant and plausible +human reactions compared to the baselines, and can generalize to unseen actor +motions and viewpoint changes.",cs.CV,"['cs.CV', 'cs.AI']" +Universal Novelty Detection through Adaptive Contrastive Learning,Hossein Mirzaei · Mojtaba Nafez · Mohammad Jafari · Mohammad Soltani · Mohammad Azizmalayeri · Jafar Habibi · Mohammad Sabokrou · Mohammad Rohban, ,,https://oist.mlds.jp/2024/02/27/two-papers-have-been-accepted-by-cvpr-2024/,,,,,nan +Cross-dimension Affinity Distillation for 3D EM Neuron Segmentation,Xiaoyu Liu · Miaomiao Cai · Yinda Chen · Yueyi Zhang · Te Shi · Ruobing Zhang · Xuejin Chen · Zhiwei Xiong, ,https://arxiv.org/html/2401.03043v1,,2401.03043v1.pdf,Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing,"The current neuron reconstruction pipeline for electron microscopy (EM) data +usually includes automatic image segmentation followed by extensive human +expert proofreading. In this work, we aim to reduce human workload by +predicting connectivity between over-segmented neuron pieces, taking both +microscopy image and 3D morphology features into account, similar to human +proofreading workflow. To this end, we first construct a dataset, named +FlyTracing, that contains millions of pairwise connections of segments +expanding the whole fly brain, which is three orders of magnitude larger than +existing datasets for neuron segment connection. To learn sophisticated +biological imaging features from the connectivity annotations, we propose a +novel connectivity-aware contrastive learning method to generate dense +volumetric EM image embedding. The learned embeddings can be easily +incorporated with any point or voxel-based morphological representations for +automatic neuron tracing. Extensive comparisons of different combination +schemes of image and morphological representation in identifying split errors +across the whole fly brain demonstrate the superiority of the proposed +approach, especially for the locations that contain severe imaging artifacts, +such as section missing and misalignment. The dataset and code are available at +https://github.com/Levishery/Flywire-Neuron-Tracing.",cs.CV,['cs.CV'] +PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation,Ardian Umam · Cheng-Kun Yang · Min-Hung Chen · Jen-Hui Chuang · Yen-Yu Lin,https://ardianumam.github.io/partdistill/,https://arxiv.org/abs/2312.04016,,2312.04016.pdf,PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation,"This paper proposes a cross-modal distillation framework, PartDistill, which +transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D +shape part segmentation. PartDistill addresses three major challenges in this +task: the lack of 3D segmentation in invisible or undetected regions in the 2D +projections, inconsistent 2D predictions by VLMs, and the lack of knowledge +accumulation across different 3D shapes. PartDistill consists of a teacher +network that uses a VLM to make 2D predictions and a student network that +learns from the 2D predictions while extracting geometrical features from +multiple 3D shapes to carry out 3D part segmentation. A bi-directional +distillation, including forward and backward distillations, is carried out +within the framework, where the former forward distills the 2D predictions to +the student network, and the latter improves the quality of the 2D predictions, +which subsequently enhances the final 3D segmentation. Moreover, PartDistill +can exploit generative models that facilitate effortless 3D shape creation for +generating knowledge sources to be distilled. Through extensive experiments, +PartDistill boosts the existing methods with substantial margins on widely used +ShapeNetPart and PartNetE datasets, by more than 15% and 12% higher mIoU +scores, respectively. The code for this work is available at +https://github.com/ardianumam/PartDistill.",cs.CV,['cs.CV'] +Diffusion-FOF: Single-view Clothed Human Reconstruction via Diffusion-based Fourier Occupancy Field,Yuanzhen Li · Fei LUO · Chunxia Xiao,https://youtu.be/jm1CsLV_5XU,https://arxiv.org/abs/2311.15855,,,SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion,"A long-standing goal of 3D human reconstruction is to create lifelike and +fully detailed 3D humans from single-view images. The main challenge lies in +inferring unknown body shapes, appearances, and clothing details in areas not +visible in the images. To address this, we propose SiTH, a novel pipeline that +uniquely integrates an image-conditioned diffusion model into a 3D mesh +reconstruction workflow. At the core of our method lies the decomposition of +the challenging single-view reconstruction problem into generative +hallucination and reconstruction subproblems. For the former, we employ a +powerful generative diffusion model to hallucinate unseen back-view appearance +based on the input images. For the latter, we leverage skinned body meshes as +guidance to recover full-body texture meshes from the input and back-view +images. SiTH requires as few as 500 3D human scans for training while +maintaining its generality and robustness to diverse images. Extensive +evaluations on two 3D human benchmarks, including our newly created one, +highlighted our method's superior accuracy and perceptual quality in 3D +textured human reconstruction. Our code and evaluation benchmark are available +at https://ait.ethz.ch/sith",cs.CV,['cs.CV'] +Distilling Semantic Priors from SAM to Efficient Image Restoration Models,Quan Zhang · Xiaoyu Liu · Wei Li · Hanting Chen · Junchao Liu · Jie Hu · Zhiwei Xiong · Chun Yuan · Yunhe Wang, ,https://arxiv.org/abs/2403.16368,,2403.16368.pdf,Distilling Semantic Priors from SAM to Efficient Image Restoration Models,"In image restoration (IR), leveraging semantic priors from segmentation +models has been a common approach to improve performance. The recent segment +anything model (SAM) has emerged as a powerful tool for extracting advanced +semantic priors to enhance IR tasks. However, the computational cost of SAM is +prohibitive for IR, compared to existing smaller IR models. The incorporation +of SAM for extracting semantic priors considerably hampers the model inference +efficiency. To address this issue, we propose a general framework to distill +SAM's semantic knowledge to boost exiting IR models without interfering with +their inference process. Specifically, our proposed framework consists of the +semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD) +scheme. SPF fuses two kinds of information between the restored image predicted +by the original IR model and the semantic mask predicted by SAM for the refined +restored image. SPD leverages a self-distillation manner to distill the fused +semantic priors to boost the performance of original IR models. Additionally, +we design a semantic-guided relation (SGR) module for SPD, which ensures +semantic feature representation space consistency to fully distill the priors. +We demonstrate the effectiveness of our framework across multiple IR models and +tasks, including deraining, deblurring, and denoising.",cs.CV,['cs.CV'] +Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features,Niladri Shekhar Dutt · Sanjeev Muralikrishnan · Niloy J. Mitra,https://diff3f.github.io/,https://arxiv.org/abs/2311.17024,,2311.17024.pdf,Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features,"We present Diff3F as a simple, robust, and class-agnostic feature descriptor +that can be computed for untextured input shapes (meshes or point clouds). Our +method distills diffusion features from image foundational models onto input +shapes. Specifically, we use the input shapes to produce depth and normal maps +as guidance for conditional image synthesis. In the process, we produce +(diffusion) features in 2D that we subsequently lift and aggregate on the +original surface. Our key observation is that even if the conditional image +generations obtained from multi-view rendering of the input shapes are +inconsistent, the associated image features are robust and, hence, can be +directly aggregated across views. This produces semantic features on the input +shapes, without requiring additional data or training. We perform extensive +experiments on multiple benchmarks (SHREC'19, SHREC'20, FAUST, and TOSCA) and +demonstrate that our features, being semantic instead of geometric, produce +reliable correspondence across both isometric and non-isometrically related +shape families. Code is available via the project page at +https://diff3f.github.io/",cs.CV,"['cs.CV', 'cs.GR']" +TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations,Bo Sun · Thibault Groueix · Chen Song · Qixing Huang · Noam Aigerman, ,https://arxiv.org/abs/2307.09892,,2307.09892.pdf,3Deformer: A Common Framework for Image-Guided Mesh Deformation,"We propose 3Deformer, a general-purpose framework for interactive 3D shape +editing. Given a source 3D mesh with semantic materials, and a user-specified +semantic image, 3Deformer can accurately edit the source mesh following the +shape guidance of the semantic image, while preserving the source topology as +rigid as possible. Recent studies of 3D shape editing mostly focus on learning +neural networks to predict 3D shapes, which requires high-cost 3D training +datasets and is limited to handling objects involved in the datasets. Unlike +these studies, our 3Deformer is a non-training and common framework, which only +requires supervision of readily-available semantic images, and is compatible +with editing various objects unlimited by datasets. In 3Deformer, the source +mesh is deformed utilizing the differentiable renderer technique, according to +the correspondences between semantic images and mesh materials. However, +guiding complex 3D shapes with a simple 2D image incurs extra challenges, that +is, the deform accuracy, surface smoothness, geometric rigidity, and global +synchronization of the edited mesh should be guaranteed. To address these +challenges, we propose a hierarchical optimization architecture to balance the +global and local shape features, and propose further various strategies and +losses to improve properties of accuracy, smoothness, rigidity, and so on. +Extensive experiments show that our 3Deformer is able to produce impressive +results and reaches the state-of-the-art level.",cs.CV,['cs.CV'] +MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying,Ryan Burgert · Brian Price · Jason Kuen · Yijun Li · Michael Ryoo,https://ryanndagreat.github.io/MAGICK,https://arxiv.org/abs/2307.10350,,2307.10350.pdf,Improving Multimodal Datasets with Image Captioning,"Massive web datasets play a key role in the success of large vision-language +models like CLIP and Flamingo. However, the raw web data is noisy, and existing +filtering methods to reduce noise often come at the expense of data diversity. +Our work focuses on caption quality as one major source of noise, and studies +how generated captions can increase the utility of web-scraped datapoints with +nondescript text. Through exploring different mixing strategies for raw and +generated captions, we outperform the best filtering method proposed by the +DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a +candidate pool of 128M image-text pairs. Our best approach is also 2x better at +Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an +effective source of text supervision. In experimenting with different image +captioning models, we also demonstrate that the performance of a model on +standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable +indicator of the utility of the captions it generates for multimodal training. +Finally, our experiments with using generated captions at DataComp's large +scale (1.28B image-text pairs) offer insights into the limitations of synthetic +text, as well as the importance of image curation with increasing training data +quantity. The synthetic captions used in our experiments are now available on +HuggingFace.",cs.LG,"['cs.LG', 'cs.CV']" +Generative Latent Coding for Ultra-Low Bitrate Image Compression,Zhaoyang Jia · Jiahao Li · Bin Li · Houqiang Li · Yan Lu, ,https://arxiv.org/abs/2403.03736,,2403.03736.pdf,Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer,"Recent progress in generative compression technology has significantly +improved the perceptual quality of compressed data. However, these advancements +primarily focus on producing high-frequency details, often overlooking the +ability of generative models to capture the prior distribution of image +content, thus impeding further bitrate reduction in extreme compression +scenarios (<0.05 bpp). Motivated by the capabilities of predictive language +models for lossless compression, this paper introduces a novel Unified Image +Generation-Compression (UIGC) paradigm, merging the processes of generation and +compression. A key feature of the UIGC framework is the adoption of +vector-quantized (VQ) image models for tokenization, alongside a multi-stage +transformer designed to exploit spatial contextual information for modeling the +prior distribution. As such, the dual-purpose framework effectively utilizes +the learned prior for entropy estimation and assists in the regeneration of +lost tokens. Extensive experiments demonstrate the superiority of the proposed +UIGC framework over existing codecs in perceptual quality and human perception, +particularly in ultra-low bitrate scenarios (<=0.03 bpp), pioneering a new +direction in generative compression.",cs.CV,"['cs.CV', 'cs.LG', 'eess.IV']" +Makeup Prior Models for 3D Facial Makeup Estimation and Applications,Xingchao Yang · Takafumi Taketomi · Yuki Endo · Yoshihiro Kanamori,https://yangxingchao.github.io/makeup-priors-page/,https://arxiv.org/abs/2403.17761,,2403.17761.pdf,Makeup Prior Models for 3D Facial Makeup Estimation and Applications,"In this work, we introduce two types of makeup prior models to extend +existing 3D face prior models: PCA-based and StyleGAN2-based priors. The +PCA-based prior model is a linear model that is easy to construct and is +computationally efficient. However, it retains only low-frequency information. +Conversely, the StyleGAN2-based model can represent high-frequency information +with relatively higher computational cost than the PCA-based model. Although +there is a trade-off between the two models, both are applicable to 3D facial +makeup estimation and related applications. By leveraging makeup prior models +and designing a makeup consistency module, we effectively address the +challenges that previous methods faced in robustly estimating makeup, +particularly in the context of handling self-occluded faces. In experiments, we +demonstrate that our approach reduces computational costs by several orders of +magnitude, achieving speeds up to 180 times faster. In addition, by improving +the accuracy of the estimated makeup, we confirm that our methods are highly +advantageous for various 3D facial makeup applications such as 3D makeup face +reconstruction, user-friendly makeup editing, makeup transfer, and +interpolation.",cs.CV,"['cs.CV', 'cs.GR']" +Asymmetric Masked Distillation for Pre-Training Small Foundation Models,Zhiyu Zhao · Bingkun Huang · Sen Xing · Gangshan Wu · Yu Qiao · Limin Wang, ,https://arxiv.org/abs/2311.03149,,,Asymmetric Masked Distillation for Pre-Training Small Foundation Models,"Self-supervised foundation models have shown great potential in computer +vision thanks to the pre-training paradigm of masked autoencoding. Scale is a +primary factor influencing the performance of these foundation models. However, +these large foundation models often result in high computational cost. This +paper focuses on pre-training relatively small vision transformer models that +could be efficiently adapted to downstream tasks. Specifically, taking +inspiration from knowledge distillation in model compression, we propose a new +asymmetric masked distillation (AMD) framework for pre-training relatively +small models with autoencoding. The core of AMD is to devise an asymmetric +masking strategy, where the teacher model is enabled to see more context +information with a lower masking ratio, while the student model is still +equipped with a high masking ratio. We design customized multi-layer feature +alignment between the teacher encoder and student encoder to regularize the +pre-training of student MAE. To demonstrate the effectiveness and versatility +of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively +small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the +ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B +model on the Something-in-Something V2 dataset, a 3.7% improvement over the +original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to +downstream tasks and obtain consistent performance improvement over the +original masked autoencoding. The code and models are available at +https://github.com/MCG-NJU/AMD.",cs.CV,['cs.CV'] +Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer,Yuang Ai · Xiaoqiang Zhou · Huaibo Huang · Lei Zhang · Ran He, ,https://arxiv.org/abs/2404.11273,,2404.11273.pdf,Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution,"Transformer-based models have achieved remarkable results in low-level vision +tasks including image super-resolution (SR). However, early Transformer-based +approaches that rely on self-attention within non-overlapping windows encounter +challenges in acquiring global information. To activate more input pixels +globally, hybrid attention models have been proposed. Moreover, training by +solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate +for capturing essential high-frequency details. This paper presents two +contributions: i) We introduce convolutional non-local sparse attention (NLSA) +blocks to extend the hybrid transformer architecture in order to further +enhance its receptive field. ii) We employ wavelet losses to train Transformer +models to improve quantitative and subjective performance. While wavelet losses +have been explored previously, showing their power in training +Transformer-based SR models is novel. Our experimental results demonstrate that +the proposed model provides state-of-the-art PSNR results as well as superior +visual performance across various benchmark datasets.",eess.IV,"['eess.IV', 'cs.CV']" +AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval,Sixing Yan · William K. Cheung · Ivor Tsang · Wan Hang Keith Chiu · Tong Terence · Ka Chun Cheung · Simon See, ,,https://www.a-star.edu.sg/cfar/research/publications,,,,,nan +Unveiling the Unknown: Unleashing the Power of Unknown to Known in Open-Set Source-Free Domain Adaptation,Fuli Wan · Han Zhao · Xu Yang · Cheng Deng, ,https://arxiv.org/abs/2312.03767,,2312.03767.pdf,Unknown Sample Discovery for Source Free Open Set Domain Adaptation,"Open Set Domain Adaptation (OSDA) aims to adapt a model trained on a source +domain to a target domain that undergoes distribution shift and contains +samples from novel classes outside the source domain. Source-free OSDA +(SF-OSDA) techniques eliminate the need to access source domain samples, but +current SF-OSDA methods utilize only the known classes in the target domain for +adaptation, and require access to the entire target domain even during +inference after adaptation, to make the distinction between known and unknown +samples. In this paper, we introduce Unknown Sample Discovery (USD) as an +SF-OSDA method that utilizes a temporally ensembled teacher model to conduct +known-unknown target sample separation and adapts the student model to the +target domain over all classes using co-training and temporal consistency +between the teacher and the student. USD promotes Jensen-Shannon distance (JSD) +as an effective measure for known-unknown sample separation. Our +teacher-student framework significantly reduces error accumulation resulting +from imperfect known-unknown sample separation, while curriculum guidance helps +to reliably learn the distinction between target known and target unknown +subspaces. USD appends the target model with an unknown class node, thus +readily classifying a target sample into any of the known or unknown classes in +subsequent post-adaptation inference stages. Empirical results show that USD is +superior to existing SF-OSDA methods and is competitive with current OSDA +models that utilize both source and target domains during adaptation.",cs.CV,"['cs.CV', 'cs.AI']" +Classes Are Not Equal: An Empirical Study on Image Recognition Fairness,Jiequan Cui · Beier Zhu · Xin Wen · Xiaojuan Qi · Bei Yu · Hanwang Zhang, ,https://arxiv.org/abs/2402.18133,,2402.18133.pdf,Classes Are Not Equal: An Empirical Study on Image Recognition Fairness,"In this paper, we present an empirical study on image recognition fairness, +i.e., extreme class accuracy disparity on balanced data like ImageNet. We +experimentally demonstrate that classes are not equal and the fairness issue is +prevalent for image classification models across various datasets, network +architectures, and model capacities. Moreover, several intriguing properties of +fairness are identified. First, the unfairness lies in problematic +representation rather than classifier bias. Second, with the proposed concept +of Model Prediction Bias, we investigate the origins of problematic +representation during optimization. Our findings reveal that models tend to +exhibit greater prediction biases for classes that are more challenging to +recognize. It means that more other classes will be confused with harder +classes. Then the False Positives (FPs) will dominate the learning in +optimization, thus leading to their poor accuracy. Further, we conclude that +data augmentation and representation learning algorithms improve overall +performance by promoting fairness to some degree in image classification. The +Code is available at +https://github.com/dvlab-research/Parametric-Contrastive-Learning.",cs.LG,"['cs.LG', 'cs.CV']" +The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding,Lorenzo Bianchi · Fabio Carrara · Nicola Messina · Claudio Gennaro · Fabrizio Falchi,https://lorebianchi98.github.io/FG-OVD/,https://arxiv.org/abs/2311.17518v2,,2311.17518v2.pdf,The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding,"Recent advancements in large vision-language models enabled visual object +detection in open-vocabulary scenarios, where object classes are defined in +free-text formats during inference. In this paper, we aim to probe the +state-of-the-art methods for open-vocabulary object detection to determine to +what extent they understand fine-grained properties of objects and their parts. +To this end, we introduce an evaluation protocol based on dynamic vocabulary +generation to test whether models detect, discern, and assign the correct +fine-grained description to objects in the presence of hard-negative classes. +We contribute with a benchmark suite of increasing difficulty and probing +different properties like color, pattern, and material. We further enhance our +investigation by evaluating several state-of-the-art open-vocabulary object +detectors using the proposed protocol and find that most existing solutions, +which shine in standard open-vocabulary benchmarks, struggle to accurately +capture and distinguish finer object details. We conclude the paper by +highlighting the limitations of current methodologies and exploring promising +research directions to overcome the discovered drawbacks. Data and code are +available at https://lorebianchi98.github.io/FG-OVD/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Style Aligned Image Generation via Shared Attention,Amir Hertz · Andrey Voynov · Shlomi Fruchter · Daniel Cohen-Or,https://style-aligned-gen.github.io/,https://arxiv.org/abs/2312.02133v1,,2312.02133v1.pdf,Style Aligned Image Generation via Shared Attention,"Large-scale Text-to-Image (T2I) models have rapidly gained prominence across +creative fields, generating visually compelling outputs from textual prompts. +However, controlling these models to ensure consistent style remains +challenging, with existing methods necessitating fine-tuning and manual +intervention to disentangle content and style. In this paper, we introduce +StyleAligned, a novel technique designed to establish style alignment among a +series of generated images. By employing minimal `attention sharing' during the +diffusion process, our method maintains style consistency across images within +T2I models. This approach allows for the creation of style-consistent images +using a reference style through a straightforward inversion operation. Our +method's evaluation across diverse styles and text prompts demonstrates +high-quality synthesis and fidelity, underscoring its efficacy in achieving +consistent style across various inputs.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +"Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration",Yuang Ai · Huaibo Huang · Xiaoqiang Zhou · Jiexiang Wang · Ran He, ,https://arxiv.org/abs/2312.02918v2,,2312.02918v2.pdf,"Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration","Despite substantial progress, all-in-one image restoration (IR) grapples with +persistent challenges in handling intricate real-world degradations. This paper +introduces MPerceiver: a novel multimodal prompt learning approach that +harnesses Stable Diffusion (SD) priors to enhance adaptiveness, +generalizability and fidelity for all-in-one image restoration. Specifically, +we develop a dual-branch module to master two types of SD prompts: textual for +holistic representation and visual for multiscale detail representation. Both +prompts are dynamically adjusted by degradation predictions from the CLIP image +encoder, enabling adaptive responses to diverse unknown degradations. Moreover, +a plug-in detail refinement module improves restoration fidelity via direct +encoder-to-decoder information transformation. To assess our method, MPerceiver +is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art +task-specific methods across most tasks. Post multitask pre-training, +MPerceiver attains a generalized representation in low-level vision, exhibiting +remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive +experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of +adaptiveness, generalizability and fidelity.",cs.CV,['cs.CV'] +FMA-Net: Flow Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring,Geunhyuk Youk · Jihyong Oh · Munchurl Kim,https://kaist-viclab.github.io/fmanet-site,https://arxiv.org/abs/2401.03707,,2401.03707.pdf,FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring,"We present a joint learning scheme of video super-resolution and deblurring, +called VSRDB, to restore clean high-resolution (HR) videos from blurry +low-resolution (LR) ones. This joint restoration problem has drawn much less +attention compared to single restoration problems. In this paper, we propose a +novel flow-guided dynamic filtering (FGDF) and iterative feature refinement +with multi-attention (FRMA), which constitutes our VSRDB framework, denoted as +FMA-Net. Specifically, our proposed FGDF enables precise estimation of both +spatio-temporally-variant degradation and restoration kernels that are aware of +motion trajectories through sophisticated motion representation learning. +Compared to conventional dynamic filtering, the FGDF enables the FMA-Net to +effectively handle large motions into the VSRDB. Additionally, the stacked FRMA +blocks trained with our novel temporal anchor (TA) loss, which temporally +anchors and sharpens features, refine features in a course-to-fine manner +through iterative updates. Extensive experiments demonstrate the superiority of +the proposed FMA-Net over state-of-the-art methods in terms of both +quantitative and qualitative quality. Codes and pre-trained models are +available at: https://kaist-viclab.github.io/fmanet-site",cs.CV,['cs.CV'] +Device-Wise Federated Network Pruning,Shangqian Gao · Junyi Li · Zeyu Zhang · Yanfu Zhang · Weidong Cai · Heng Huang, ,,https://lijunyi95.github.io/publications/,,,,,nan +Differentiable Display Photometric Stereo,Seokjun Choi · Seungwoo Yoon · Giljoo Nam · Seungyong Lee · Seung-Hwan Baek, ,https://arxiv.org/abs/2306.13325,,2306.13325.pdf,Differentiable Display Photometric Stereo,"Photometric stereo leverages variations in illumination conditions to +reconstruct surface normals. Display photometric stereo, which employs a +conventional monitor as an illumination source, has the potential to overcome +limitations often encountered in bulky and difficult-to-use conventional +setups. In this paper, we present differentiable display photometric stereo +(DDPS), addressing an often overlooked challenge in display photometric stereo: +the design of display patterns. Departing from using heuristic display +patterns, DDPS learns the display patterns that yield accurate normal +reconstruction for a target system in an end-to-end manner. To this end, we +propose a differentiable framework that couples basis-illumination image +formation with analytic photometric-stereo reconstruction. The differentiable +framework facilitates the effective learning of display patterns via +auto-differentiation. Also, for training supervision, we propose to use 3D +printing for creating a real-world training dataset, enabling accurate +reconstruction on the target real-world setup. Finally, we exploit that +conventional LCD monitors emit polarized light, which allows for the optical +separation of diffuse and specular reflections when combined with a +polarization camera, leading to accurate normal reconstruction. Extensive +evaluation of DDPS shows improved normal-reconstruction accuracy compared to +heuristic patterns and demonstrates compelling properties such as robustness to +pattern initialization, calibration errors, and simplifications in image +formation and reconstruction.",cs.CV,['cs.CV'] +"Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction",Yizhi Wang · Wallace Lira · Wenqi Wang · Ali Mahdavi Amiri · Hao Zhang,https://yizhiwang96.github.io/Slice3D/,https://arxiv.org/abs/2312.02221,,2312.02221.pdf,"Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction","We introduce multi-slice reasoning, a new notion for single-view 3D +reconstruction which challenges the current and prevailing belief that +multi-view synthesis is the most natural conduit between single-view and 3D. +Our key observation is that object slicing is more advantageous than altering +views to reveal occluded structures. Specifically, slicing is more +occlusion-revealing since it can peel through any occluders without +obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed +to unveil all hidden object parts. We realize our idea by developing Slice3D, a +novel method for single-view 3D reconstruction which first predicts multi-slice +images from a single RGB image and then integrates the slices into a 3D model +using a coordinate-based transformer network for signed distance prediction. +The slice images can be regressed or generated, both through a U-Net based +network. For the former, we inject a learnable slice indicator code to +designate each decoded image into a spatial slice location, while the slice +generator is a denoising diffusion model operating on the entirety of slice +images stacked on the input channels. We conduct extensive evaluation against +state-of-the-art alternatives to demonstrate superiority of our method, +especially in recovering complex and severely occluded shape structures, amid +ambiguities. All Slice3D results were produced by networks trained on a single +Nvidia A40 GPU, with an inference time less than 20 seconds.",cs.CV,"['cs.CV', 'cs.GR']" +Cyclic Learning for Binaural Audio Generation and Localization,Zhaojian Li · Bin Zhao · Yuan Yuan, ,https://arxiv.org/abs/2311.07630,,2311.07630.pdf,Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation,"Binaural stereo audio is recorded by imitating the way the human ear receives +sound, which provides people with an immersive listening experience. Existing +approaches leverage autoencoders and directly exploit visual spatial +information to synthesize binaural stereo, resulting in a limited +representation of visual guidance. For the first time, we propose a visually +guided generative adversarial approach for generating binaural stereo audio +from mono audio. Specifically, we develop a Stereo Audio Generation Model +(SAGM), which utilizes shared spatio-temporal visual information to guide the +generator and the discriminator to work separately. The shared visual +information is updated alternately in the generative adversarial stage, +allowing the generator and discriminator to deliver their respective guided +knowledge while visually sharing. The proposed method learns bidirectional +complementary visual information, which facilitates the expression of visual +guidance in generation. In addition, spatial perception is a crucial attribute +of binaural stereo audio, and thus the evaluation of stereo spatial perception +is essential. However, previous metrics failed to measure the spatial +perception of audio. To this end, a metric to measure the spatial perception of +audio is proposed for the first time. The proposed metric is capable of +measuring the magnitude and direction of spatial perception in the temporal +dimension. Further, considering its function, it is feasible to utilize it +instead of demanding user studies to some extent. The proposed method achieves +state-of-the-art performance on 2 datasets and 5 evaluation metrics. +Qualitative experiments and user studies demonstrate that the method generates +space-realistic stereo audio.",cs.SD,"['cs.SD', 'cs.CV', 'cs.LG', 'eess.AS']" +OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition,Tongjia Chen · Hongshan Yu · Zhengeng Yang · Zechuan Li · Wei Sun · Chen Chen,https://tomchen-ctj.github.io/OST/,https://arxiv.org/abs/2312.00096,,2312.00096.pdf,OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition,"Due to the resource-intensive nature of training vision-language models on +expansive video data, a majority of studies have centered on adapting +pre-trained image-language models to the video domain. Dominant pipelines +propose to tackle the visual discrepancies with additional temporal learners +while overlooking the substantial discrepancy for web-scaled descriptive +narratives and concise action category names, leading to less distinct semantic +space and potential performance limitations. In this work, we prioritize the +refinement of text knowledge to facilitate generalizable video recognition. To +address the limitations of the less distinct semantic space of category names, +we prompt a large language model (LLM) to augment action class names into +Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving +as a knowledge base for general recognition. Moreover, to assign the best +descriptors with different video instances, we propose Optimal Descriptor +Solver, forming the video recognition problem as solving the optimal matching +flow across frame-level representations and descriptors. Comprehensive +evaluations in zero-shot, few-shot, and fully supervised video recognition +highlight the effectiveness of our approach. Our best model achieves a +state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.",cs.CV,['cs.CV'] +Visual Objectification in Films: Towards a New AI Task for Video Interpretation,Julie Tores · Lucile Sassatelli · Hui-Yin Wu · Clement Bergman · Léa Andolfi · Victor Ecrement · Frederic Precioso · Thierry Devars · Magali GUARESI · Virginie Julliard · Sarah Lécossais, ,https://arxiv.org/abs/2401.13296,,2401.13296.pdf,Visual Objectification in Films: Towards a New AI Task for Video Interpretation,"In film gender studies, the concept of 'male gaze' refers to the way the +characters are portrayed on-screen as objects of desire rather than subjects. +In this article, we introduce a novel video-interpretation task, to detect +character objectification in films. The purpose is to reveal and quantify the +usage of complex temporal patterns operated in cinema to produce the cognitive +perception of objectification. We introduce the ObyGaze12 dataset, made of 1914 +movie clips densely annotated by experts for objectification concepts +identified in film studies and psychology. We evaluate recent vision models, +show the feasibility of the task and where the challenges remain with concept +bottleneck models. Our new dataset and code are made available to the +community.",cs.CV,['cs.CV'] +Bilateral Event Mining and Complementary for Event Stream Super-Resolution,Zhilin Huang · Quanmin Liang · Yijie Yu · Chujun Qin · Xiawu Zheng · Kai Huang · Zikun Zhou · Wenming Yang, ,https://arxiv.org/abs/2405.10037v1,,2405.10037v1.pdf,Bilateral Event Mining and Complementary for Event Stream Super-Resolution,"Event Stream Super-Resolution (ESR) aims to address the challenge of +insufficient spatial resolution in event streams, which holds great +significance for the application of event cameras in complex scenarios. +Previous works for ESR often process positive and negative events in a mixed +paradigm. This paradigm limits their ability to effectively model the unique +characteristics of each event and mutually refine each other by considering +their correlations. In this paper, we propose a bilateral event mining and +complementary network (BMCNet) to fully leverage the potential of each event +and capture the shared information to complement each other simultaneously. +Specifically, we resort to a two-stream network to accomplish comprehensive +mining of each type of events individually. To facilitate the exchange of +information between two streams, we propose a bilateral information exchange +(BIE) module. This module is layer-wisely embedded between two streams, +enabling the effective propagation of hierarchical global information while +alleviating the impact of invalid information brought by inherent +characteristics of events. The experimental results demonstrate that our +approach outperforms the previous state-of-the-art methods in ESR, achieving +performance improvements of over 11\% on both real and synthetic datasets. +Moreover, our method significantly enhances the performance of event-based +downstream tasks such as object recognition and video reconstruction. Our code +is available at https://github.com/Lqm26/BMCNet-ESR.",cs.CV,['cs.CV'] +Instance-Aware Group Quantization for Vision Transformers,Jaehyeon Moon · Dohyung Kim · Jun Yong Cheon · Bumsub Ham,https://cvlab.yonsei.ac.kr/projects/IGQ-ViT/,https://arxiv.org/abs/2404.00928,,2404.00928.pdf,Instance-Aware Group Quantization for Vision Transformers,"Post-training quantization (PTQ) is an efficient model compression technique +that quantizes a pretrained full-precision model using only a small calibration +set of unlabeled samples without retraining. PTQ methods for convolutional +neural networks (CNNs) provide quantization results comparable to +full-precision counterparts. Directly applying them to vision transformers +(ViTs), however, incurs severe performance degradation, mainly due to the +differences in architectures between CNNs and ViTs. In particular, the +distribution of activations for each channel vary drastically according to +input instances, making PTQ methods for CNNs inappropriate for ViTs. To address +this, we introduce instance-aware group quantization for ViTs (IGQ-ViT). To +this end, we propose to split the channels of activation maps into multiple +groups dynamically for each input instance, such that activations within each +group share similar statistical properties. We also extend our scheme to +quantize softmax attentions across tokens. In addition, the number of groups +for each layer is adjusted to minimize the discrepancies between predictions +from quantized and full-precision models, under a bit-operation (BOP) +constraint. We show extensive experimental results on image classification, +object detection, and instance segmentation, with various transformer +architectures, demonstrating the effectiveness of our approach.",cs.CV,"['cs.CV', 'cs.LG']" +MLP Can Be A Good Transformer Learner,Sihao Lin · Pumeng Lyu · Dongrui Liu · Tao Tang · Xiaodan Liang · Andy Song · Xiaojun Chang, ,https://arxiv.org/abs/2404.05657,,2404.05657.pdf,MLP Can Be A Good Transformer Learner,"Self-attention mechanism is the key of the Transformer but often criticized +for its computation demands. Previous token pruning works motivate their +methods from the view of computation redundancy but still need to load the full +network and require same memory costs. This paper introduces a novel strategy +that simplifies vision transformers and reduces computational load through the +selective removal of non-essential attention layers, guided by entropy +considerations. We identify that regarding the attention layer in bottom +blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit +the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited +since they exhibit smaller feature entropy compared to those MLPs in the top +blocks. Therefore, we propose to integrate the uninformative attention layers +into their subsequent counterparts by degenerating them into identical mapping, +yielding only MLP in certain transformer blocks. Experimental results on +ImageNet-1k show that the proposed method can remove 40% attention layer of +DeiT-B, improving throughput and memory bound without performance compromise. +Code is available at https://github.com/sihaoevery/lambda_vit.",cs.CV,['cs.CV'] +BiPer: Binary Neural Networks using a Periodic Function,Edwin Vargas · Claudia Correa · Carlos Hinojosa · Henry Arguello, ,https://arxiv.org/abs/2404.01278,,2404.01278.pdf,BiPer: Binary Neural Networks using a Periodic Function,"Quantized neural networks employ reduced precision representations for both +weights and activations. This quantization process significantly reduces the +memory requirements and computational complexity of the network. Binary Neural +Networks (BNNs) are the extreme quantization case, representing values with +just one bit. Since the sign function is typically used to map real values to +binary values, smooth approximations are introduced to mimic the gradients +during error backpropagation. Thus, the mismatch between the forward and +backward models corrupts the direction of the gradient, causing training +inconsistency problems and performance degradation. In contrast to current BNN +approaches, we propose to employ a binary periodic (BiPer) function during +binarization. Specifically, we use a square wave for the forward pass to obtain +the binary values and employ the trigonometric sine function with the same +period of the square wave as a differentiable surrogate during the backward +pass. We demonstrate that this approach can control the quantization error by +using the frequency of the periodic function and improves network performance. +Extensive experiments validate the effectiveness of BiPer in benchmark datasets +and network architectures, with improvements of up to 1% and 0.69% with respect +to state-of-the-art methods in the classification task over CIFAR-10 and +ImageNet, respectively. Our code is publicly available at +https://github.com/edmav4/BiPer.",cs.CV,['cs.CV'] +Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations,Daan de Geus · Gijs Dubbelman,https://www.tue-mps.org/tapps/,https://arxiv.org/abs/2311.18618,,2311.18618.pdf,JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation,"Part-aware panoptic segmentation is a problem of computer vision that aims to +provide a semantic understanding of the scene at multiple levels of +granularity. More precisely, semantic areas, object instances, and semantic +parts are predicted simultaneously. In this paper, we present our Joint +Panoptic Part Fusion (JPPF) that combines the three individual segmentations +effectively to obtain a panoptic-part segmentation. Two aspects are of utmost +importance for this: First, a unified model for the three problems is desired +that allows for mutually improved and consistent representation learning. +Second, balancing the combination so that it gives equal importance to all +individual results during fusion. Our proposed JPPF is parameter-free and +dynamically balances its input. The method is evaluated and compared on the +Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets in +terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we +verify the importance of our fair fusion, highlight its most significant impact +for areas that can be further segmented into parts, and demonstrate the +generalization capabilities of our design without fine-tuning on 5 additional +datasets.",cs.CV,['cs.CV'] +CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection,Haonan Zhang · Longjun Liu · Yuqi Huang · YangZhao · Xinyu Lei · Bihan Wen, ,,https://github.com/zhnxjtu/CaKDP,,,,,nan +Bilateral Propagation Network for Depth Completion,Jie Tang · Fei-Peng Tian · Boshi An · Jian Li · Ping Tan, ,https://arxiv.org/abs/2403.11270,,2403.11270.pdf,Bilateral Propagation Network for Depth Completion,"Depth completion aims to derive a dense depth map from sparse depth +measurements with a synchronized color image. Current state-of-the-art (SOTA) +methods are predominantly propagation-based, which work as an iterative +refinement on the initial estimated dense depth. However, the initial depth +estimations mostly result from direct applications of convolutional layers on +the sparse depth map. In this paper, we present a Bilateral Propagation Network +(BP-Net), that propagates depth at the earliest stage to avoid directly +convolving on sparse data. Specifically, our approach propagates the target +depth from nearby depth measurements via a non-linear model, whose coefficients +are generated through a multi-layer perceptron conditioned on both +\emph{radiometric difference} and \emph{spatial distance}. By integrating +bilateral propagation with multi-modal fusion and depth refinement in a +multi-scale framework, our BP-Net demonstrates outstanding performance on both +indoor and outdoor scenes. It achieves SOTA on the NYUv2 dataset and ranks 1st +on the KITTI depth completion benchmark at the time of submission. Experimental +results not only show the effectiveness of bilateral propagation but also +emphasize the significance of early-stage propagation in contrast to the +refinement stage. Our code and trained models will be available on the project +page.",cs.CV,['cs.CV'] +SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection,Peng Qi · Zehong Yan · Wynne Hsu · Mong Li Lee,https://pengqi.site/Sniffer/,https://arxiv.org/abs/2403.03170,,2403.03170.pdf,SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection,"Misinformation is a prevalent societal issue due to its potential high risks. +Out-of-context (OOC) misinformation, where authentic images are repurposed with +false text, is one of the easiest and most effective ways to mislead audiences. +Current methods focus on assessing image-text consistency but lack convincing +explanations for their judgments, which is essential for debunking +misinformation. While Multimodal Large Language Models (MLLMs) have rich +knowledge and innate capability for visual reasoning and explanation +generation, they still lack sophistication in understanding and discovering the +subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel +multimodal large language model specifically engineered for OOC misinformation +detection and explanation. SNIFFER employs two-stage instruction tuning on +InstructBLIP. The first stage refines the model's concept alignment of generic +objects with news-domain entities and the second stage leverages language-only +GPT-4 generated OOC-specific instruction data to fine-tune the model's +discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not +only detects inconsistencies between text and image but also utilizes external +knowledge for contextual verification. Our experiments show that SNIFFER +surpasses the original MLLM by over 40% and outperforms state-of-the-art +methods in detection accuracy. SNIFFER also provides accurate and persuasive +explanations as validated by quantitative and human evaluations.",cs.MM,"['cs.MM', 'cs.AI', 'cs.CL', 'cs.CV', 'cs.CY']" +Semantic-aware SAM for Point-Prompted Instance Segmentation,Zhaoyang Wei · Pengfei Chen · Xuehui Yu · Guorong Li · Jianbin Jiao · Zhenjun Han, ,https://arxiv.org/abs/2312.15895,,2312.15895.pdf,Semantic-aware SAM for Point-Prompted Instance Segmentation,"Single-point annotation in visual tasks, with the goal of minimizing +labelling costs, is becoming increasingly prominent in research. Recently, +visual foundation models, such as Segment Anything (SAM), have gained +widespread usage due to their robust zero-shot capabilities and exceptional +annotation performance. However, SAM's class-agnostic output and high +confidence in local segmentation introduce 'semantic ambiguity', posing a +challenge for precise category-specific segmentation. In this paper, we +introduce a cost-effective category-specific segmenter using SAM. To tackle +this challenge, we have devised a Semantic-Aware Instance Segmentation Network +(SAPNet) that integrates Multiple Instance Learning (MIL) with matching +capability and SAM with point prompts. SAPNet strategically selects the most +representative mask proposals generated by SAM to supervise segmentation, with +a specific focus on object category information. Moreover, we introduce the +Point Distance Guidance and Box Mining Strategy to mitigate inherent +challenges: 'group' and 'local' issues in weakly supervised segmentation. These +strategies serve to further enhance the overall segmentation performance. The +experimental results on Pascal VOC and COCO demonstrate the promising +performance of our proposed SAPNet, emphasizing its semantic matching +capabilities and its potential to advance point-prompted instance segmentation. +The code will be made publicly available.",cs.CV,['cs.CV'] +Loopy-SLAM: Dense Neural SLAM with Loop Closures,Lorenzo Liso · Erik Sandström · Vladimir Yugay · Luc Van Gool · Martin R. Oswald, ,https://arxiv.org/abs/2402.09944,,2402.09944.pdf,Loopy-SLAM: Dense Neural SLAM with Loop Closures,"Neural RGBD SLAM techniques have shown promise in dense Simultaneous +Localization And Mapping (SLAM), yet face challenges such as error accumulation +during camera tracking resulting in distorted maps. In response, we introduce +Loopy-SLAM that globally optimizes poses and the dense 3D model. We use +frame-to-model tracking using a data-driven point-based submap generation +method and trigger loop closures online by performing global place recognition. +Robust pose graph optimization is used to rigidly align the local submaps. As +our representation is point based, map corrections can be performed efficiently +without the need to store the entire history of input frames used for mapping +as typically required by methods employing a grid based mapping structure. +Evaluation on the synthetic Replica and real-world TUM-RGBD and ScanNet +datasets demonstrate competitive or superior performance in tracking, mapping, +and rendering accuracy when compared to existing dense neural RGBD SLAM +methods. Project page: notchla.github.io/Loopy-SLAM.",cs.CV,['cs.CV'] +Aligning Logits Generatively for Principled Black-Box Knowledge Distillation,Jing Ma · Xiang Xiang · Ke Wang · Yuchuan Wu · Yongbin Li, ,https://arxiv.org/abs/2403.01427,,,Logit Standardization in Knowledge Distillation,"Knowledge distillation involves transferring soft labels from a teacher to a +student using a shared temperature-based softmax function. However, the +assumption of a shared temperature between teacher and student implies a +mandatory exact match between their logits in terms of logit range and +variance. This side-effect limits the performance of student, considering the +capacity discrepancy between them and the finding that the innate logit +relations of teacher are sufficient for student to learn. To address this +issue, we propose setting the temperature as the weighted standard deviation of +logit and performing a plug-and-play Z-score pre-process of logit +standardization before applying softmax and Kullback-Leibler divergence. Our +pre-process enables student to focus on essential logit relations from teacher +rather than requiring a magnitude match, and can improve the performance of +existing logit-based distillation methods. We also show a typical case where +the conventional setting of sharing temperature between teacher and student +cannot reliably yield the authentic distillation evaluation; nonetheless, this +challenge is successfully alleviated by our Z-score. We extensively evaluate +our method for various student and teacher models on CIFAR-100 and ImageNet, +showing its significant superiority. The vanilla knowledge distillation powered +by our pre-process can achieve favorable performance against state-of-the-art +methods, and other distillation variants can obtain considerable gain with the +assistance of our pre-process.",cs.CV,['cs.CV'] +Grid Diffusion Models for Text-to-Video Generation,Taegyeong Lee · Soyeong Kwon · Taehwan Kim,https://taegyeong-lee.github.io/text2video,https://arxiv.org/abs/2404.00234v1,,2404.00234v1.pdf,Grid Diffusion Models for Text-to-Video Generation,"Recent advances in the diffusion models have significantly improved +text-to-image generation. However, generating videos from text is a more +challenging task than generating images from text, due to the much larger +dataset and higher computational cost required. Most existing video generation +methods use either a 3D U-Net architecture that considers the temporal +dimension or autoregressive generation. These methods require large datasets +and are limited in terms of computational costs compared to text-to-image +generation. To tackle these challenges, we propose a simple but effective novel +grid diffusion for text-to-video generation without temporal dimension in +architecture and a large text-video paired dataset. We can generate a +high-quality video using a fixed amount of GPU memory regardless of the number +of frames by representing the video as a grid image. Additionally, since our +method reduces the dimensions of the video to the dimensions of the image, +various image-based methods can be applied to videos, such as text-guided video +manipulation from image manipulation. Our proposed method outperforms the +existing methods in both quantitative and qualitative evaluations, +demonstrating the suitability of our model for real-world video generation.",cs.CV,['cs.CV'] +Wonder3D: Single Image to 3D using Cross-Domain Diffusion,Xiaoxiao Long · Yuan-Chen Guo · Cheng Lin · Yuan Liu · Zhiyang Dou · Lingjie Liu · Yuexin Ma · Song-Hai Zhang · Marc Habermann · Christian Theobalt · Wenping Wang, ,https://arxiv.org/abs/2310.15008,,2310.15008.pdf,Wonder3D: Single Image to 3D using Cross-Domain Diffusion,"In this work, we introduce Wonder3D, a novel method for efficiently +generating high-fidelity textured meshes from single-view images.Recent methods +based on Score Distillation Sampling (SDS) have shown the potential to recover +3D geometry from 2D diffusion priors, but they typically suffer from +time-consuming per-shape optimization and inconsistent geometry. In contrast, +certain works directly produce 3D information via fast network inferences, but +their results are often of low quality and lack geometric details. To +holistically improve the quality, consistency, and efficiency of image-to-3D +tasks, we propose a cross-domain diffusion model that generates multi-view +normal maps and the corresponding color images. To ensure consistency, we +employ a multi-view cross-domain attention mechanism that facilitates +information exchange across views and modalities. Lastly, we introduce a +geometry-aware normal fusion algorithm that extracts high-quality surfaces from +the multi-view 2D representations. Our extensive evaluations demonstrate that +our method achieves high-quality reconstruction results, robust generalization, +and reasonably good efficiency compared to prior works.",cs.CV,['cs.CV'] +Towards High-fidelity Artistic Image Vectorization via Texture-Encapsulated Shape Parameterization,Ye Chen · Bingbing Ni · Jinfan Liu · Xiaoyang Huang · Xuanhong Chen, ,https://arxiv.org/abs/2308.13628,,2308.13628.pdf,HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture,"We present HiFiHR, a high-fidelity hand reconstruction approach that utilizes +render-and-compare in the learning-based framework from a single image, capable +of generating visually plausible and accurate 3D hand meshes while recovering +realistic textures. Our method achieves superior texture reconstruction by +employing a parametric hand model with predefined texture assets, and by +establishing a texture reconstruction consistency between the rendered and +input images during training. Moreover, based on pretraining the network on an +annotated dataset, we apply varying degrees of supervision using our pipeline, +i.e., self-supervision, weak supervision, and full supervision, and discuss the +various levels of contributions of the learned high-fidelity textures in +enhancing hand pose and shape estimation. Experimental results on public +benchmarks including FreiHAND and HO-3D demonstrate that our method outperforms +the state-of-the-art hand reconstruction methods in texture reconstruction +quality while maintaining comparable accuracy in pose and shape estimation. Our +code is available at https://github.com/viridityzhu/HiFiHR.",cs.CV,"['cs.CV', 'cs.AI']" +Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks,Yuhao Liu · Zhanghan Ke · Fang Liu · Nanxuan Zhao · Rynson W.H. Lau, ,https://arxiv.org/abs/2403.00644,,2403.00644.pdf,Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks,"Diffusion models trained on large-scale datasets have achieved remarkable +progress in image synthesis. However, due to the randomness in the diffusion +process, they often struggle with handling diverse low-level tasks that require +details preservation. To overcome this limitation, we present a new Diff-Plugin +framework to enable a single pre-trained diffusion model to generate +high-fidelity results across a variety of low-level tasks. Specifically, we +first propose a lightweight Task-Plugin module with a dual branch design to +provide task-specific priors, guiding the diffusion process in preserving image +content. We then propose a Plugin-Selector that can automatically select +different Task-Plugins based on the text instruction, allowing users to edit +images by indicating multiple low-level tasks with natural language. We conduct +extensive experiments on 8 low-level vision tasks. The results demonstrate the +superiority of Diff-Plugin over existing methods, particularly in real-world +scenarios. Our ablations further validate that Diff-Plugin is stable, +schedulable, and supports robust training across different dataset sizes.",cs.CV,['cs.CV'] +Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation,Feilong Tang · Zhongxing Xu · Zhaojun QU · Wei Feng · xingjian jiang · Zongyuan Ge, ,https://arxiv.org/abs/2403.07630,,2403.07630.pdf,Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation,"Recent weakly supervised semantic segmentation (WSSS) methods strive to +incorporate contextual knowledge to improve the completeness of class +activation maps (CAM). In this work, we argue that the knowledge bias between +instances and contexts affects the capability of the prototype to sufficiently +understand instance semantics. Inspired by prototype learning theory, we +propose leveraging prototype awareness to capture diverse and fine-grained +feature attributes of instances. The hypothesis is that contextual prototypes +might erroneously activate similar and frequently co-occurring object +categories due to this knowledge bias. Therefore, we propose to enhance the +prototype representation ability by mitigating the bias to better capture +spatial coverage in semantic object regions. With this goal, we present a +Context Prototype-Aware Learning (CPAL) strategy, which leverages semantic +context to enrich instance comprehension. The core of this method is to +accurately capture intra-class variations in object features through +context-aware prototypes, facilitating the adaptation to the semantic +attributes of various instances. We design feature distribution alignment to +optimize prototype awareness, aligning instance feature distributions with +dense features. In addition, a unified training framework is proposed to +combine label-guided classification supervision and prototypes-guided +self-supervision. Experimental results on PASCAL VOC 2012 and MS COCO 2014 show +that CPAL significantly improves off-the-shelf methods and achieves +state-of-the-art performance. The project is available at +https://github.com/Barrett-python/CPAL.",cs.CV,"['cs.CV', 'cs.AI']" +Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation,Luca Barsellotti · Roberto Amoroso · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara,https://aimagelab.github.io/freeda/,https://arxiv.org/abs/2404.06542,,2404.06542.pdf,Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation,"Open-vocabulary semantic segmentation aims at segmenting arbitrary categories +expressed in textual form. Previous works have trained over large amounts of +image-caption pairs to enforce pixel-level multimodal alignments. However, +captions provide global information about the semantics of a given image but +lack direct localization of individual concepts. Further, training on +large-scale datasets inevitably brings significant computational costs. In this +paper, we propose FreeDA, a training-free diffusion-augmented method for +open-vocabulary semantic segmentation, which leverages the ability of diffusion +models to visually localize generated concepts and local-global similarities to +match class-agnostic regions with semantic classes. Our approach involves an +offline stage in which textual-visual reference embeddings are collected, +starting from a large set of captions and leveraging visual and semantic +contexts. At test time, these are queried to support the visual matching +process, which is carried out by jointly considering class-agnostic regions and +global semantic similarities. Extensive analyses demonstrate that FreeDA +achieves state-of-the-art performance on five datasets, surpassing previous +methods by more than 7.0 average points in terms of mIoU and without requiring +any training.",cs.CV,['cs.CV'] +TULIP: Multi-camera 3D Precision Assessment of Parkinson's Disease,Kyungdo Kim · Sihan Lyu · Sneha Mantri · Timothy DUNN, ,,https://www.nature.com/articles/s41746-023-00905-9,,,,,nan +ControlRoom3D: Room Generation using Semantic Controls,Jonas Schult · Sam Tsai · Lukas Höllein · Bichen Wu · Jialiang Wang · Chih-Yao Ma · Kunpeng Li · Xiaofang Wang · Felix Wimbauer · Zijian He · Peizhao Zhang · Bastian Leibe · Peter Vajda · Ji Hou,https://jonasschult.github.io/ControlRoom3D/,https://arxiv.org/abs/2312.05208,,2312.05208.pdf,ControlRoom3D: Room Generation using Semantic Proxy Rooms,"Manually creating 3D environments for AR/VR applications is a complex process +requiring expert knowledge in 3D modeling software. Pioneering works facilitate +this process by generating room meshes conditioned on textual style +descriptions. Yet, many of these automatically generated 3D meshes do not +adhere to typical room layouts, compromising their plausibility, e.g., by +placing several beds in one bedroom. To address these challenges, we present +ControlRoom3D, a novel method to generate high-quality room meshes. Central to +our approach is a user-defined 3D semantic proxy room that outlines a rough +room layout based on semantic bounding boxes and a textual description of the +overall room style. Our key insight is that when rendered to 2D, this 3D +representation provides valuable geometric and semantic information to control +powerful 2D models to generate 3D consistent textures and geometry that aligns +well with the proxy room. Backed up by an extensive study including +quantitative metrics and qualitative user evaluations, our method generates +diverse and globally plausible 3D room meshes, thus empowering users to design +3D rooms effortlessly without specialized knowledge.",cs.CV,['cs.CV'] +Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging,Bhargav Ghanekar · Salman Siddique Khan · Pranav Sharma · Shreyas Singh · Vivek Boominathan · Kaushik Mitra · Ashok Veeraraghavan,https://shadowfax11.github.io/cads/,https://arxiv.org/abs/2402.18102,,2402.18102.pdf,Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging,"Passive, compact, single-shot 3D sensing is useful in many application areas +such as microscopy, medical imaging, surgical navigation, and autonomous +driving where form factor, time, and power constraints can exist. Obtaining +RGB-D scene information over a short imaging distance, in an ultra-compact form +factor, and in a passive, snapshot manner is challenging. Dual-pixel (DP) +sensors are a potential solution to achieve the same. DP sensors collect light +rays from two different halves of the lens in two interleaved pixel arrays, +thus capturing two slightly different views of the scene, like a stereo camera +system. However, imaging with a DP sensor implies that the defocus blur size is +directly proportional to the disparity seen between the views. This creates a +trade-off between disparity estimation vs. deblurring accuracy. To improve this +trade-off effect, we propose CADS (Coded Aperture Dual-Pixel Sensing), in which +we use a coded aperture in the imaging lens along with a DP sensor. In our +approach, we jointly learn an optimal coded pattern and the reconstruction +algorithm in an end-to-end optimization setting. Our resulting CADS imaging +system demonstrates improvement of >1.5dB PSNR in all-in-focus (AIF) estimates +and 5-6% in depth estimation quality over naive DP sensing for a wide range of +aperture settings. Furthermore, we build the proposed CADS prototypes for DSLR +photography settings and in an endoscope and a dermoscope form factor. Our +novel coded dual-pixel sensing approach demonstrates accurate RGB-D +reconstruction results in simulations and real-world experiments in a passive, +snapshot, and compact manner.",eess.IV,"['eess.IV', 'cs.CV']" +Real-time 3D-aware Portrait Video Relighting,Ziqi Cai · Kaiwen Jiang · Shu-Yu Chen · Yu-Kun Lai · Hongbo Fu · Boxin Shi · Lin Gao,http://geometrylearning.com/VideoRelighting/,https://arxiv.org/html/2402.14000v1,,2402.14000v1.pdf,Real-time 3D-aware Portrait Editing from a Single Image,"This work presents 3DPE, a practical tool that can efficiently edit a face +image following given prompts, like reference images or text descriptions, in +the 3D-aware manner. To this end, a lightweight module is distilled from a 3D +portrait generator and a text-to-image model, which provide prior knowledge of +face geometry and open-vocabulary editing capability, respectively. Such a +design brings two compelling advantages over existing approaches. First, our +system achieves real-time editing with a feedforward network (i.e., ~0.04s per +image), over 100x faster than the second competitor. Second, thanks to the +powerful priors, our module could focus on the learning of editing-related +variations, such that it manages to handle various types of editing +simultaneously in the training phase and further supports fast adaptation to +user-specified novel types of editing during inference (e.g., with ~5min +fine-tuning per case). The code, the model, and the interface will be made +publicly available to facilitate future research.",cs.CV,['cs.CV'] +DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes,Xiaoyu Zhou · Zhiwei Lin · Xiaojun Shan · Yongtao Wang · Deqing Sun · Ming-Hsuan Yang, ,https://arxiv.org/abs/2312.07920,,2312.07920.pdf,DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes,"We present DrivingGaussian, an efficient and effective framework for +surrounding dynamic autonomous driving scenes. For complex scenes with moving +objects, we first sequentially and progressively model the static background of +the entire scene with incremental static 3D Gaussians. We then leverage a +composite dynamic Gaussian graph to handle multiple moving objects, +individually reconstructing each object and restoring their accurate positions +and occlusion relationships within the scene. We further use a LiDAR prior for +Gaussian Splatting to reconstruct scenes with greater details and maintain +panoramic consistency. DrivingGaussian outperforms existing methods in dynamic +driving scene reconstruction and enables photorealistic surround-view synthesis +with high-fidelity and multi-camera consistency. Our project page is at: +https://github.com/VDIGPKU/DrivingGaussian.",cs.CV,['cs.CV'] +A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition,Yusheng Dai · HangChen · Jun Du · Ruoyu Wang · shihao chen · Haotian Wang · Chin-Hui Lee,https://github.com/dalision/ModalBiasAVSR,https://arxiv.org/abs/2403.04245,,2403.04245.pdf,A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition,"Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to +be sensitive to missing video frames, performing even worse than +single-modality models. While applying the dropout technique to the video +modality enhances robustness to missing frames, it simultaneously results in a +performance loss when dealing with complete data input. In this paper, we +investigate this contrasting phenomenon from the perspective of modality bias +and reveal that an excessive modality bias on the audio caused by dropout is +the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) +to systematically describe the relationship between modality bias and +robustness against missing modality in multimodal systems. Building on these +findings, we propose a novel Multimodal Distribution Approximation with +Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio +modality and to maintain performance and robustness simultaneously. Finally, to +address an entirely missing modality, we adopt adapters to dynamically switch +decision strategies. The effectiveness of our proposed approach is evaluated +and validated through a series of comprehensive experiments using the MISP2021 +and MISP2022 datasets. Our code is available at +https://github.com/dalision/ModalBiasAVSR",cs.SD,"['cs.SD', 'cs.CV', 'cs.LG', 'cs.MM', 'eess.AS']" +GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors,Yuan Dong · Qi Zuo · Xiaodong Gu · Weihao Yuan · zhengyi zhao · Zilong Dong · Liefeng Bo · Qixing Huang, ,https://arxiv.org/abs/2401.17603,,2401.17603.pdf,Topology-Aware Latent Diffusion for 3D Shape Generation,"We introduce a new generative model that combines latent diffusion with +persistent homology to create 3D shapes with high diversity, with a special +emphasis on their topological characteristics. Our method involves representing +3D shapes as implicit fields, then employing persistent homology to extract +topological features, including Betti numbers and persistence diagrams. The +shape generation process consists of two steps. Initially, we employ a +transformer-based autoencoding module to embed the implicit representation of +each 3D shape into a set of latent vectors. Subsequently, we navigate through +the learned latent space via a diffusion model. By strategically incorporating +topological features into the diffusion process, our generative module is able +to produce a richer variety of 3D shapes with different topological structures. +Furthermore, our framework is flexible, supporting generation tasks constrained +by a variety of inputs, including sparse and partial point clouds, as well as +sketches. By modifying the persistence diagrams, we can alter the topology of +the shapes generated from these input modalities.",cs.CV,"['cs.CV', 'I.3.5; I.2.10']" +Shallow-Deep Collaborative Learning for Unsupervised Visible-Infrared Person Re-Identification,Bin Yang · Jun Chen · Mang Ye, ,,https://dl.acm.org/doi/10.1145/3581783.3612077,,,,,nan +Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM,"Tongyan Hua · Addison, Lin Wang",https://vlis2022.github.io/nerf-slam-benchmark/,https://arxiv.org/abs/2403.19473,,2403.19473.pdf,Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM,"Implicit neural representation (INR), in combination with geometric +rendering, has recently been employed in real-time dense RGB-D SLAM. Despite +active research endeavors being made, there lacks a unified protocol for fair +evaluation, impeding the evolution of this area. In this work, we establish, to +our knowledge, the first open-source benchmark framework to evaluate the +performance of a wide spectrum of commonly used INRs and rendering functions +for mapping and localization. The goal of our benchmark is to 1) gain an +intuition of how different INRs and rendering functions impact mapping and +localization and 2) establish a unified evaluation protocol w.r.t. the design +choices that may impact the mapping and localization. With the framework, we +conduct a large suite of experiments, offering various insights in choosing the +INRs and geometric rendering functions: for example, the dense feature grid +outperforms other INRs (e.g. tri-plane and hash grid), even when geometric and +color features are jointly encoded for memory efficiency. To extend the +findings into the practical scenario, a hybrid encoding strategy is proposed to +bring the best of the accuracy and completion from the grid-based and +decomposition-based INRs. We further propose explicit hybrid encoding for +high-fidelity dense grid mapping to comply with the RGB-D SLAM system that puts +the premise on robustness and computation efficiency.",cs.CV,['cs.CV'] +Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion,"Hao Ai · Addison, Lin Wang", ,http://export.arxiv.org/abs/2403.16376,,2403.16376.pdf,Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion,"360 depth estimation has recently received great attention for 3D +reconstruction owing to its omnidirectional field of view (FoV). Recent +approaches are predominantly focused on cross-projection fusion with +geometry-based re-projection: they fuse 360 images with equirectangular +projection (ERP) and another projection type, e.g., cubemap projection to +estimate depth with the ERP format. However, these methods suffer from 1) +limited local receptive fields, making it hardly possible to capture large FoV +scenes, and 2) prohibitive computational cost, caused by the complex +cross-projection fusion module design. In this paper, we propose Elite360D, a +novel framework that inputs the ERP image and icosahedron projection (ICOSAP) +point set, which is undistorted and spatially continuous. Elite360D is superior +in its capacity in learning a representation from a local-with-global +perspective. With a flexible ERP image encoder, it includes an ICOSAP point +encoder, and a Bi-projection Bi-attention Fusion (B2F) module (totally ~1M +parameters). Specifically, the ERP image encoder can take various perspective +image-trained backbones (e.g., ResNet, Transformer) to extract local features. +The point encoder extracts the global features from the ICOSAP. Then, the B2F +module captures the semantic- and distance-aware dependencies between each +pixel of the ERP feature and the entire ICOSAP feature set. Without specific +backbone design and obvious computational cost increase, Elite360D outperforms +the prior arts on several benchmark datasets.",cs.CV,['cs.CV'] +EventDance: Unsupervised Cross-modal Source-free Adaptation for Event-based Object Recognition,"Xu Zheng · Addison, Lin Wang", ,https://arxiv.org/abs/2403.14082,,2403.14082.pdf,EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition,"In this paper, we make the first attempt at achieving the cross-modal (i.e., +image-to-events) adaptation for event-based object recognition without +accessing any labeled source image data owning to privacy and commercial +issues. Tackling this novel problem is non-trivial due to the novelty of event +cameras and the distinct modality gap between images and events. In particular, +as only the source model is available, a hurdle is how to extract the knowledge +from the source model by only using the unlabeled target event data while +achieving knowledge transfer. To this end, we propose a novel framework, dubbed +EventDance for this unsupervised source-free cross-modal adaptation problem. +Importantly, inspired by event-to-video reconstruction methods, we propose a +reconstruction-based modality bridging (RMB) module, which reconstructs +intensity frames from events in a self-supervised manner. This makes it +possible to build up the surrogate images to extract the knowledge (i.e., +labels) from the source model. We then propose a multi-representation knowledge +adaptation (MKA) module that transfers the knowledge to target models learning +events with multiple representation types for fully exploring the +spatiotemporal information of events. The two modules connecting the source and +target models are mutually updated so as to achieve the best performance. +Experiments on three benchmark datasets with two adaption settings show that +EventDance is on par with prior methods utilizing the source data.",cs.CV,['cs.CV'] +GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation,"WEIMING ZHANG · Yexin Liu · Xu Zheng · Addison, Lin Wang", ,https://arxiv.org/abs/2403.16370,,2403.16370.pdf,GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation,"This paper tackles a novel yet challenging problem: how to transfer knowledge +from the emerging Segment Anything Model (SAM) -- which reveals impressive +zero-shot instance segmentation capacity -- to learn a compact panoramic +semantic segmentation model, i.e., student, without requiring any labeled data. +This poses considerable challenges due to SAM's inability to provide semantic +labels and the large capacity gap between SAM and the student. To this end, we +propose a novel framework, called GoodSAM, that introduces a teacher assistant +(TA) to provide semantic information, integrated with SAM to generate ensemble +logits to achieve knowledge transfer. Specifically, we propose a +Distortion-Aware Rectification (DAR) module that first addresses the distortion +problem of panoramic images by imposing prediction-level consistency and +boundary enhancement. This subtly enhances TA's prediction capacity on +panoramic images. DAR then incorporates a cross-task complementary fusion block +to adaptively merge the predictions of SAM and TA to obtain more reliable +ensemble logits. Moreover, we introduce a Multi-level Knowledge Adaptation +(MKA) module to efficiently transfer the multi-level feature knowledge from TA +and ensemble logits to learn a compact student model. Extensive experiments on +two benchmarks show that our GoodSAM achieves a remarkable +3.75\% mIoU +improvement over the state-of-the-art (SOTA) domain adaptation methods. Also, +our most lightweight model achieves comparable performance to the SOTA methods +with only 3.7M parameters.",cs.CV,['cs.CV'] +ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More,"Jiazhou Zhou · Xu Zheng · Yuanhuiyi Lyu · Addison, Lin Wang", ,https://arxiv.org/abs/2403.12534,,2403.12534.pdf,ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More,"Event cameras have recently been shown beneficial for practical vision tasks, +such as action recognition, thanks to their high temporal resolution, power +efficiency, and reduced privacy concerns. However, current research is hindered +by 1) the difficulty in processing events because of their prolonged duration +and dynamic actions with complex and ambiguous semantics and 2) the redundant +action depiction of the event frame representation with fixed stacks. We find +language naturally conveys abundant semantic information, rendering it +stunningly superior in reducing semantic uncertainty. In light of this, we +propose ExACT, a novel approach that, for the first time, tackles event-based +action recognition from a cross-modal conceptualizing perspective. Our ExACT +brings two technical contributions. Firstly, we propose an adaptive +fine-grained event (AFE) representation to adaptively filter out the repeated +events for the stationary objects while preserving dynamic ones. This subtly +enhances the performance of ExACT without extra computational cost. Then, we +propose a conceptual reasoning-based uncertainty estimation module, which +simulates the recognition process to enrich the semantic representation. In +particular, conceptual reasoning builds the temporal relation based on the +action semantics, and uncertainty estimation tackles the semantic uncertainty +of actions based on the distributional representation. Experiments show that +our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), +90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.",cs.CV,['cs.CV'] +"Semantics, Distortion, and Style Matter: Towards Source-free UDA for Panoramic Segmentation","Xu Zheng · Pengyuan Zhou · ATHANASIOS · Addison, Lin Wang",https://vlislab22.github.io/360SFUDA/,https://arxiv.org/abs/2403.12505,,2403.12505.pdf,"Semantics, Distortion, and Style Matter: Towards Source-free UDA for Panoramic Segmentation","This paper addresses an interesting yet challenging problem -- source-free +unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic +segmentation -- given only a pinhole image-trained model (i.e., source) and +unlabeled panoramic images (i.e., target). Tackling this problem is nontrivial +due to the semantic mismatches, style discrepancies, and inevitable distortion +of panoramic images. To this end, we propose a novel method that utilizes +Tangent Projection (TP) as it has less distortion and meanwhile slits the +equirectangular projection (ERP) with a fixed FoV to mimic the pinhole images. +Both projections are shown effective in extracting knowledge from the source +model. However, the distinct projection discrepancies between source and target +domains impede the direct knowledge transfer; thus, we propose a panoramic +prototype adaptation module (PPAM) to integrate panoramic prototypes from the +extracted knowledge for adaptation. We then impose the loss constraints on both +predictions and prototypes and propose a cross-dual attention module (CDAM) at +the feature level to better align the spatial and channel characteristics +across the domains and projections. Both knowledge extraction and transfer +processes are synchronously updated to reach the best performance. Extensive +experiments on the synthetic and real-world benchmarks, including outdoor and +indoor scenarios, demonstrate that our method achieves significantly better +performance than prior SFUDA methods for pinhole-to-panoramic adaptation.",cs.CV,['cs.CV'] +UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All,"Yuanhuiyi Lyu · Xu Zheng · Jiazhou Zhou · Addison, Lin Wang", ,https://arxiv.org/abs/2405.16108,,2405.16108.pdf,OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All,"Research on multi-modal learning dominantly aligns the modalities in a +unified space at training, and only a single one is taken for prediction at +inference. However, for a real machine, e.g., a robot, sensors could be added +or removed at any time. Thus, it is crucial to enable the machine to tackle the +mismatch and unequal-scale problems of modality combinations between training +and inference. In this paper, we tackle these problems from a new perspective: +""Modalities Help Modalities"". Intuitively, we present OmniBind, a novel +two-stage learning framework that can achieve any modality combinations and +interaction. It involves teaching data-constrained, a.k.a, student, modalities +to be aligned with the well-trained data-abundant, a.k.a, teacher, modalities. +This subtly enables the adaptive fusion of any modalities to build a unified +representation space for any combinations. Specifically, we propose Cross-modal +Alignment Distillation (CAD) to address the unequal-scale problem between +student and teacher modalities and effectively align student modalities into +the teacher modalities' representation space in stage one. We then propose an +Adaptive Fusion (AF) module to fuse any modality combinations and learn a +unified representation space in stage two. To address the mismatch problem, we +aggregate existing datasets and combine samples from different modalities by +the same semantics. This way, we build the first dataset for training and +evaluation that consists of teacher (image, text) and student (touch, thermal, +event, point cloud, audio) modalities and enables omni-bind for any of them. +Extensive experiments on the recognition task show performance gains over prior +arts by an average of 4.05 % on the arbitrary modality combination setting. It +also achieves state-of-the-art performance for a single modality, e.g., touch, +with a 4.34 % gain.",cs.CV,['cs.CV'] +C3Net: Compound Conditioned ControlNet for Multimodal Content Generation,Juntao Zhang · Yuehuai LIU · Yu-Wing Tai · Chi-Keung Tang, ,https://arxiv.org/abs/2311.17951,,2311.17951.pdf,C3Net: Compound Conditioned ControlNet for Multimodal Content Generation,"We present Compound Conditioned ControlNet, C3Net, a novel generative neural +architecture taking conditions from multiple modalities and synthesizing +multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the +ControlNet architecture to jointly train and make inferences on a +production-ready diffusion model and its trainable copies. Specifically, C3Net +first aligns the conditions from multi-modalities to the same semantic latent +space using modality-specific encoders based on contrastive training. Then, it +generates multimodal outputs based on the aligned latent space, whose semantic +information is combined using a ControlNet-like architecture called Control +C3-UNet. Correspondingly, with this system design, our model offers an improved +solution for joint-modality generation through learning and explaining +multimodal conditions instead of simply taking linear interpolations on the +latent space. Meanwhile, as we align conditions to a unified latent space, +C3Net only requires one trainable Control C3-UNet to work on multimodal +semantic information. Furthermore, our model employs unimodal pretraining on +the condition alignment stage, outperforming the non-pretrained alignment even +on relatively scarce training data and thus demonstrating high-quality compound +condition generation. We contribute the first high-quality tri-modal validation +set to validate quantitatively that C3Net outperforms or is on par with first +and contemporary state-of-the-art multimodal generation. Our codes and +tri-modal dataset will be released.",cs.LG,['cs.LG'] +Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach,"Guoqiang Liang · Kanghao Chen · Hangyu Li · Yunfan Lu · Addison, Lin Wang",https://vlislab22.github.io/eg-lowlight/.,https://arxiv.org/abs/2404.00834v1,,2404.00834v1.pdf,Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach,"Event camera has recently received much attention for low-light image +enhancement (LIE) thanks to their distinct advantages, such as high dynamic +range. However, current research is prohibitively restricted by the lack of +large-scale, real-world, and spatial-temporally aligned event-image datasets. +To this end, we propose a real-world (indoor and outdoor) dataset comprising +over 30K pairs of images and events under both low and normal illumination +conditions. To achieve this, we utilize a robotic arm that traces a consistent +non-linear trajectory to curate the dataset with spatial alignment precision +under 0.03mm. We then introduce a matching alignment strategy, rendering 90% of +our dataset with errors less than 0.01s. Based on the dataset, we propose a +novel event-guided LIE approach, called EvLight, towards robust performance in +real-world low-light scenes. Specifically, we first design the multi-scale +holistic fusion branch to extract holistic structural and textural information +from both events and images. To ensure robustness against variations in the +regional illumination and noise, we then introduce a Signal-to-Noise-Ratio +(SNR)-guided regional feature selection to selectively fuse features of images +from regions with high SNR and enhance those with low SNR by extracting +regional structure information from events. Extensive experiments on our +dataset and the synthetic SDSD dataset demonstrate our EvLight significantly +surpasses the frame-based methods. Code and datasets are available at +https://vlislab22.github.io/eg-lowlight/.",cs.CV,['cs.CV'] +Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning,Yixiong Zou · Yicong Liu · Yiman Hu · Yuhua Li · Ruixuan Li, ,https://arxiv.org/abs/2403.00567,,2403.00567.pdf,Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning,"Cross-domain few-shot learning (CDFSL) aims to acquire knowledge from limited +training data in the target domain by leveraging prior knowledge transferred +from source domains with abundant training samples. CDFSL faces challenges in +transferring knowledge across dissimilar domains and fine-tuning models with +limited training data. To address these challenges, we initially extend the +analysis of loss landscapes from the parameter space to the representation +space, which allows us to simultaneously interpret the transferring and +fine-tuning difficulties of CDFSL models. We observe that sharp minima in the +loss landscapes of the representation space result in representations that are +hard to transfer and fine-tune. Moreover, existing flatness-based methods have +limited generalization ability due to their short-range flatness. To enhance +the transferability and facilitate fine-tuning, we introduce a simple yet +effective approach to achieve long-range flattening of the minima in the loss +landscape. This approach considers representations that are differently +normalized as minima in the loss landscape and flattens the high-loss region in +the middle by randomly sampling interpolated representations. We implement this +method as a new normalization layer that replaces the original one in both CNNs +and ViTs. This layer is simple and lightweight, introducing only a minimal +number of additional parameters. Experimental results on 8 datasets demonstrate +that our approach outperforms state-of-the-art methods in terms of average +accuracy. Moreover, our method achieves performance improvements of up to 9\% +compared to the current best approaches on individual datasets. Our code will +be released.",cs.CV,"['cs.CV', 'cs.AI']" +PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization,Xu Peng · Junwei Zhu · Boyuan Jiang · Ying Tai · Donghao Luo · Jiangning Zhang · Wei Lin · Taisong Jin · Chengjie Wang · Rongrong Ji, ,https://arxiv.org/abs/2312.06354,,2312.06354.pdf,PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization,"Recent advancements in personalized image generation using diffusion models +have been noteworthy. However, existing methods suffer from inefficiencies due +to the requirement for subject-specific fine-tuning. This computationally +intensive process hinders efficient deployment, limiting practical usability. +Moreover, these methods often grapple with identity distortion and limited +expression diversity. In light of these challenges, we propose PortraitBooth, +an innovative approach designed for high efficiency, robust identity +preservation, and expression-editable text-to-image generation, without the +need for fine-tuning. PortraitBooth leverages subject embeddings from a face +recognition model for personalized image generation without fine-tuning. It +eliminates computational overhead and mitigates identity distortion. The +introduced dynamic identity preservation strategy further ensures close +resemblance to the original image identity. Moreover, PortraitBooth +incorporates emotion-aware cross-attention control for diverse facial +expressions in generated images, supporting text-driven expression editing. Its +scalability enables efficient and high-quality image creation, including +multi-subject generation. Extensive results demonstrate superior performance +over other state-of-the-art methods in both single and multiple image +generation scenarios.",cs.CV,['cs.CV'] +Discriminability-Driven Channel Selection for Out-of-Distribution Detection,Yue Yuan · Rundong He · Yicong Dong · Zhongyi Han · Yilong Yin, ,,https://www.semanticscholar.org/paper/Exploring-Channel-Aware-Typical-Features-for-He-Yuan/755390c365c4a39445f73ed09fe673f2b823876d,,,,,nan +Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising,Haijin Zeng · Jiezhang Cao · Yongyong Chen · Kai Zhang · Hiep Luong · Wilfried Philips, ,https://arxiv.org/abs/2311.11417,,,DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model,"This paper endeavors to advance the precision of snapshot compressive imaging +(SCI) reconstruction for multispectral image (MSI). To achieve this, we +integrate the advantageous attributes of established SCI techniques and an +image generative model, propose a novel structured zero-shot diffusion model, +dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior +and optimization-based methodologies, complemented by the generative +capabilities offered by the contemporary denoising diffusion model. +Specifically, firstly, we employ a pre-trained diffusion model, which has been +trained on a substantial corpus of RGB images, as the generative denoiser +within the Plug-and-Play framework for the first time. This integration allows +for the successful completion of SCI reconstruction, especially in the case +that current methods struggle to address effectively. Secondly, we +systematically account for spectral band correlations and introduce a robust +methodology to mitigate wavelength mismatch, thus enabling seamless adaptation +of the RGB diffusion model to MSIs. Thirdly, an accelerated algorithm is +implemented to expedite the resolution of the data subproblem. This +augmentation not only accelerates the convergence rate but also elevates the +quality of the reconstruction process. We present extensive testing to show +that DiffSCI exhibits discernible performance enhancements over prevailing +self-supervised and zero-shot approaches, surpassing even supervised +transformer counterparts across both simulated and real datasets. Our code will +be available.",cs.CV,['cs.CV'] +Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection,Huan Liu · Zichang Tan · Chuangchuang Tan · Yunchao Wei · Jingdong Wang · Yao Zhao,https://github.com/Michel-liu/FatFormer,https://arxiv.org/abs/2312.16649,,2312.16649.pdf,Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection,"In this paper, we study the problem of generalizable synthetic image +detection, aiming to detect forgery images from diverse generative methods, +e.g., GANs and diffusion models. Cutting-edge solutions start to explore the +benefits of pre-trained models, and mainly follow the fixed paradigm of solely +training an attached classifier, e.g., combining frozen CLIP-ViT with a +learnable linear layer in UniFD. However, our analysis shows that such a fixed +paradigm is prone to yield detectors with insufficient learning regarding +forgery representations. We attribute the key challenge to the lack of forgery +adaptation, and present a novel forgery-aware adaptive transformer approach, +namely FatFormer. Based on the pre-trained vision-language spaces of CLIP, +FatFormer introduces two core designs for the adaption to build generalized +forgery representations. First, motivated by the fact that both image and +frequency analysis are essential for synthetic image detection, we develop a +forgery-aware adapter to adapt image features to discern and integrate local +forgery traces within image and frequency domains. Second, we find that +considering the contrastive objectives between adapted image features and text +prompt embeddings, a previously overlooked aspect, results in a nontrivial +generalization improvement. Accordingly, we introduce language-guided alignment +to supervise the forgery adaptation with image and text prompts in FatFormer. +Experiments show that, by coupling these two designs, our approach tuned on +4-class ProGAN data attains a remarkable detection performance, achieving an +average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen +diffusion models with 95% accuracy.",cs.CV,['cs.CV'] +"What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions",Brian Chen · Nina Shvetsova · Andrew Rouditchenko · Daniel Kondermann · Samuel Thomas · Shih-Fu Chang · Rogerio Feris · James Glass · Hilde Kuehne, ,,https://openreview.net/forum?id=eEtfBIjzWi,,,,,nan +CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization,Yao Ni · Piotr Koniusz, ,https://arxiv.org/abs/2404.00521,,2404.00521.pdf,CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization,"Generative Adversarial Networks (GANs) significantly advanced image +generation but their performance heavily depends on abundant training data. In +scenarios with limited data, GANs often struggle with discriminator overfitting +and unstable training. Batch Normalization (BN), despite being known for +enhancing generalization and training stability, has rarely been used in the +discriminator of Data-Efficient GANs. Our work addresses this gap by +identifying a critical flaw in BN: the tendency for gradient explosion during +the centering and scaling steps. To tackle this issue, we present CHAIN +(lipsCHitz continuity constrAIned Normalization), which replaces the +conventional centering step with zero-mean regularization and integrates a +Lipschitz continuity constraint in the scaling step. CHAIN further enhances GAN +training by adaptively interpolating the normalized and unnormalized features, +effectively avoiding discriminator overfitting. Our theoretical analyses firmly +establishes CHAIN's effectiveness in reducing gradients in latent features and +weights, improving stability and generalization in GAN training. Empirical +evidence supports our theory. CHAIN achieves state-of-the-art results in +data-limited scenarios on CIFAR-10/100, ImageNet, five low-shot and seven +high-resolution few-shot image datasets. Code: +https://github.com/MaxwellYaoNi/CHAIN",cs.LG,"['cs.LG', 'cs.CV']" +Improving Plasticity in Online Continual Learning via Collaborative Learning,Maorong Wang · Nicolas Michel · Ling Xiao · Toshihiko Yamasaki, ,https://arxiv.org/abs/2312.00600,,2312.00600.pdf,Improving Plasticity in Online Continual Learning via Collaborative Learning,"Online Continual Learning (CL) solves the problem of learning the +ever-emerging new classification tasks from a continuous data stream. Unlike +its offline counterpart, in online CL, the training data can only be seen once. +Most existing online CL research regards catastrophic forgetting (i.e., model +stability) as almost the only challenge. In this paper, we argue that the +model's capability to acquire new knowledge (i.e., model plasticity) is another +challenge in online CL. While replay-based strategies have been shown to be +effective in alleviating catastrophic forgetting, there is a notable gap in +research attention toward improving model plasticity. To this end, we propose +Collaborative Continual Learning (CCL), a collaborative learning based strategy +to improve the model's capability in acquiring new concepts. Additionally, we +introduce Distillation Chain (DC), a collaborative learning scheme to boost the +training of the models. We adapt CCL-DC to existing representative online CL +works. Extensive experiments demonstrate that even if the learners are +well-trained with state-of-the-art online CL methods, our strategy can still +improve model plasticity dramatically, and thereby improve the overall +performance by a large margin. The source code of our work is available at +https://github.com/maorong-wang/CCL-DC.",cs.LG,['cs.LG'] +Bi-SSC: Geometric-Semantic Bidirectional Fusion for Camera-based 3D Semantic Scene Completion,Yujie Xue · Ruihui Li · F anWu · Zhuo Tang · Kenli Li · Duan Mingxing, ,https://arxiv.org/abs/2312.05752,,2312.05752.pdf,Camera-based 3D Semantic Scene Completion with Sparse Guidance Network,"Semantic scene completion (SSC) aims to predict the semantic occupancy of +each voxel in the entire 3D scene from limited observations, which is an +emerging and critical task for autonomous driving. Recently, many studies have +turned to camera-based SSC solutions due to the richer visual cues and +cost-effectiveness of cameras. However, existing methods usually rely on +sophisticated and heavy 3D models to directly process the lifted 3D features +that are not discriminative enough for clear segmentation boundaries. In this +paper, we adopt the dense-sparse-dense design and propose an end-to-end +camera-based SSC framework, termed SGN, to diffuse semantics from the semantic- +and occupancy-aware seed voxels to the whole scene based on geometry prior and +occupancy information. By designing hybrid guidance (sparse semantic and +geometry guidance) and effective voxel aggregation for spatial occupancy and +geometry priors, we enhance the feature separation between different categories +and expedite the convergence of semantic diffusion. Extensive experimental +results on the SemanticKITTI dataset demonstrate the superiority of our SGN +over existing state-of-the-art methods.",cs.CV,['cs.CV'] +ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object,Chenshuang Zhang · Fei Pan · Junmo Kim · In So Kweon · Chengzhi Mao,https://github.com/chenshuang-zhang/imagenet_d,https://arxiv.org/abs/2403.18775,,2403.18775.pdf,ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object,"We establish rigorous benchmarks for visual perception robustness. Synthetic +images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific +type of evaluation over synthetic corruptions, backgrounds, and textures, yet +those robustness benchmarks are restricted in specified variations and have low +synthetic quality. In this work, we introduce generative model as a data source +for synthesizing hard images that benchmark deep models' robustness. Leveraging +diffusion models, we are able to generate images with more diversified +backgrounds, textures, and materials than any prior work, where we term this +benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a +significant accuracy drop to a range of vision models, from the standard ResNet +visual classifier to the latest foundation models like CLIP and MiniGPT-4, +significantly reducing their accuracy by up to 60\%. Our work suggests that +diffusion models can be an effective source to test vision models. The code and +dataset are available at https://github.com/chenshuang-zhang/imagenet_d.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Object Pose Estimation via the Aggregation of Diffusion Features,Tianfu Wang · Guosheng Hu · Hongguang Wang,https://github.com/Tianfu18/diff-feats-pose,https://arxiv.org/abs/2403.18791,,2403.18791.pdf,Object Pose Estimation via the Aggregation of Diffusion Features,"Estimating the pose of objects from images is a crucial task of 3D scene +understanding, and recent approaches have shown promising results on very large +benchmarks. However, these methods experience a significant performance drop +when dealing with unseen objects. We believe that it results from the limited +generalizability of image features. To address this problem, we have an +in-depth analysis on the features of diffusion models, e.g. Stable Diffusion, +which hold substantial potential for modeling unseen objects. Based on this +analysis, we then innovatively introduce these diffusion features for object +pose estimation. To achieve this, we propose three distinct architectures that +can effectively capture and aggregate diffusion features of different +granularity, greatly improving the generalizability of object pose estimation. +Our approach outperforms the state-of-the-art methods by a considerable margin +on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our +method achieves higher accuracy than the previous best arts on unseen objects: +98.2% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the +strong generalizability of our method. Our code is released at +https://github.com/Tianfu18/diff-feats-pose.",cs.CV,['cs.CV'] +Efficient Meshflow and Optical Flow Estimation from Event Cameras,Xinglong Luo · Ao Luo · Zhengning Wang · Chunyu Lin · Bing Zeng · Shuaicheng Liu,https://github.com/boomluo02/EEMFlow,https://arxiv.org/abs/2307.05033,,2307.05033.pdf,Towards Anytime Optical Flow Estimation with Event Cameras,"Optical flow estimation is a fundamental task in the field of autonomous +driving. Event cameras are capable of responding to log-brightness changes in +microseconds. Its characteristic of producing responses only to the changing +region is particularly suitable for optical flow estimation. In contrast to the +super low-latency response speed of event cameras, existing datasets collected +via event cameras, however, only provide limited frame rate optical flow ground +truth, (e.g., at 10Hz), greatly restricting the potential of event-driven +optical flow. To address this challenge, we put forward a high-frame-rate, +low-latency event representation Unified Voxel Grid, sequentially fed into the +network bin by bin. We then propose EVA-Flow, an EVent-based Anytime Flow +estimation network to produce high-frame-rate event optical flow with only +low-frame-rate optical flow ground truth for supervision. The key component of +our EVA-Flow is the stacked Spatiotemporal Motion Refinement (SMR) module, +which predicts temporally dense optical flow and enhances the accuracy via +spatial-temporal motion refinement. The time-dense feature warping utilized in +the SMR module provides implicit supervision for the intermediate optical flow. +Additionally, we introduce the Rectified Flow Warp Loss (RFWL) for the +unsupervised evaluation of intermediate optical flow in the absence of ground +truth. This is, to the best of our knowledge, the first work focusing on +anytime optical flow estimation via event cameras. A comprehensive variety of +experiments on MVSEC, DESC, and our EVA-FlowSet demonstrates that EVA-Flow +achieves competitive performance, super-low-latency (5ms), fastest inference +(9.2ms), time-dense motion estimation (200Hz), and strong generalization. Our +code will be available at https://github.com/Yaozhuwa/EVA-Flow.",cs.CV,"['cs.CV', 'cs.RO', 'eess.IV']" +MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior,Honghua Chen · Chen Change Loy · Xingang Pan, ,https://arxiv.org/abs/2405.02859,,2405.02859.pdf,MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior,"Despite the emergence of successful NeRF inpainting methods built upon +explicit RGB and depth 2D inpainting supervisions, these methods are inherently +constrained by the capabilities of their underlying 2D inpainters. This is due +to two key reasons: (i) independently inpainting constituent images results in +view-inconsistent imagery, and (ii) 2D inpainters struggle to ensure +high-quality geometry completion and alignment with inpainted RGB images. + To overcome these limitations, we propose a novel approach called MVIP-NeRF +that harnesses the potential of diffusion priors for NeRF inpainting, +addressing both appearance and geometry aspects. MVIP-NeRF performs joint +inpainting across multiple views to reach a consistent solution, which is +achieved via an iterative optimization process based on Score Distillation +Sampling (SDS). Apart from recovering the rendered RGB images, we also extract +normal maps as a geometric representation and define a normal SDS loss that +motivates accurate geometry inpainting and alignment with the appearance. +Additionally, we formulate a multi-view SDS score function to distill +generative priors simultaneously from different view images, ensuring +consistent visual completion when dealing with large view variations. Our +experimental results show better appearance and geometry recovery than previous +NeRF inpainting methods.",cs.CV,['cs.CV'] +Functional Diffusion,Biao Zhang · Peter Wonka, ,https://arxiv.org/abs/2311.15435,,2311.15435.pdf,Functional Diffusion,"We propose a new class of generative diffusion models, called functional +diffusion. In contrast to previous work, functional diffusion works on samples +that are represented by functions with a continuous domain. Functional +diffusion can be seen as an extension of classical diffusion models to an +infinite-dimensional domain. Functional diffusion is very versatile as images, +videos, audio, 3D shapes, deformations, \etc, can be handled by the same +framework with minimal changes. In addition, functional diffusion is especially +suited for irregular data or data defined in non-standard domains. In our work, +we derive the necessary foundations for functional diffusion and propose a +first implementation based on the transformer architecture. We show generative +results on complicated signed distance functions and deformation functions +defined on 3D surfaces.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping,Hyeongjun Kwon · Jinhyun Jang · Jin Kim · Kwonyoung Kim · Kwanghoon Sohn, ,https://arxiv.org/abs/2404.00974v1,,2404.00974v1.pdf,Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping,"Visual scenes are naturally organized in a hierarchy, where a coarse semantic +is recursively comprised of several fine details. Exploring such a visual +hierarchy is crucial to recognize the complex relations of visual elements, +leading to a comprehensive scene understanding. In this paper, we propose a +Visual Hierarchy Mapper (Hi-Mapper), a novel approach for enhancing the +structured understanding of the pre-trained Deep Neural Networks (DNNs). +Hi-Mapper investigates the hierarchical organization of the visual scene by 1) +pre-defining a hierarchy tree through the encapsulation of probability +densities; and 2) learning the hierarchical relations in hyperbolic space with +a novel hierarchical contrastive loss. The pre-defined hierarchy tree +recursively interacts with the visual features of the pre-trained DNNs through +hierarchy decomposition and encoding procedures, thereby effectively +identifying the visual hierarchy and enhancing the recognition of an entire +scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances +the representation capability of DNNs, leading to an improved performance on +various tasks, including image classification and dense prediction tasks.",cs.CV,['cs.CV'] +Neural Underwater Scene Representation,Yunkai Tang · Chengxuan Zhu · Renjie Wan · Chao Xu · Boxin Shi, ,,https://freebutuselesssoul.github.io/publications/cvpr2024a,,,,,nan +ViewFusion: Towards Multi-View Consistency via Interpolated Denoising,Xianghui Yang · Gil Avraham · Yan Zuo · Sameera Ramasinghe · Loris Bazzani · Anton van den Hengel, ,https://arxiv.org/abs/2402.18842,,2402.18842.pdf,ViewFusion: Towards Multi-View Consistency via Interpolated Denoising,"Novel-view synthesis through diffusion models has demonstrated remarkable +potential for generating diverse and high-quality images. Yet, the independent +process of image generation in these prevailing methods leads to challenges in +maintaining multiple-view consistency. To address this, we introduce +ViewFusion, a novel, training-free algorithm that can be seamlessly integrated +into existing pre-trained diffusion models. Our approach adopts an +auto-regressive method that implicitly leverages previously generated views as +context for the next view generation, ensuring robust multi-view consistency +during the novel-view generation process. Through a diffusion process that +fuses known-view information via interpolated denoising, our framework +successfully extends single-view conditioned models to work in multiple-view +conditional settings without any additional fine-tuning. Extensive experimental +results demonstrate the effectiveness of ViewFusion in generating consistent +and detailed novel views.",cs.CV,['cs.CV'] +Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation,Ruicong Liu · Takehiko Ohkawa · Mingfang Zhang · Yoichi Sato, ,https://arxiv.org/abs/2403.04381,,2403.04381.pdf,Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation,"The pursuit of accurate 3D hand pose estimation stands as a keystone for +understanding human activity in the realm of egocentric vision. The majority of +existing estimation methods still rely on single-view images as input, leading +to potential limitations, e.g., limited field-of-view and ambiguity in depth. +To address these problems, adding another camera to better capture the shape of +hands is a practical direction. However, existing multi-view hand pose +estimation methods suffer from two main drawbacks: 1) Requiring multi-view +annotations for training, which are expensive. 2) During testing, the model +becomes inapplicable if camera parameters/layout are not the same as those used +in training. In this paper, we propose a novel Single-to-Dual-view adaptation +(S2DHand) solution that adapts a pre-trained single-view estimator to dual +views. Compared with existing multi-view training methods, 1) our adaptation +process is unsupervised, eliminating the need for multi-view annotation. 2) +Moreover, our method can handle arbitrary dual-view pairs with unknown camera +parameters, making the model applicable to diverse camera settings. +Specifically, S2DHand is built on certain stereo constraints, including +pair-wise cross-view consensus and invariance of transformation between both +views. These two stereo constraints are used in a complementary manner to +generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal +that S2DHand achieves significant improvements on arbitrary camera pairs under +both in-dataset and cross-dataset settings, and outperforms existing adaptation +methods with leading performance. Project page: +https://github.com/MickeyLLG/S2DHand.",cs.CV,['cs.CV'] +Backpropagation-free Network for 3D Test-time Adaptation,YANSHUO WANG · Ali Cheraghian · Zeeshan Hayder · JIE HONG · Sameera Ramasinghe · Shafin Rahman · David Ahmedt-Aristizabal · Xuesong Li · Lars Petersson · Mehrtash Harandi, ,https://arxiv.org/abs/2403.18442,,2403.18442.pdf,Backpropagation-free Network for 3D Test-time Adaptation,"Real-world systems often encounter new data over time, which leads to +experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods +tend to apply computationally heavy and memory-intensive backpropagation-based +approaches to handle this. Here, we propose a novel method that uses a +backpropagation-free approach for TTA for the specific case of 3D data. Our +model uses a two-stream architecture to maintain knowledge about the source +domain as well as complementary target-domain-specific information. The +backpropagation-free property of our model helps address the well-known +forgetting problem and mitigates the error accumulation issue. The proposed +method also eliminates the need for the usually noisy process of +pseudo-labeling and reliance on costly self-supervised training. Moreover, our +method leverages subspace learning, effectively reducing the distribution +variance between the two domains. Furthermore, the source-domain-specific and +the target-domain-specific streams are aligned using a novel entropy-based +adaptive fusion strategy. Extensive experiments on popular benchmarks +demonstrate the effectiveness of our method. The code will be available at +\url{https://github.com/abie-e/BFTT3D}.",cs.CV,['cs.CV'] +Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment,Muhammad Sohail Danish · Muhammad Haris Khan · Muhammad Akhtar Munir · M. Sarfraz · Mohsen Ali, ,https://arxiv.org/abs/2405.14497,,2405.14497.pdf,Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment,"In this work, we tackle the problem of domain generalization for object +detection, specifically focusing on the scenario where only a single source +domain is available. We propose an effective approach that involves two key +steps: diversifying the source domain and aligning detections based on class +prediction confidence and localization. Firstly, we demonstrate that by +carefully selecting a set of augmentations, a base detector can outperform +existing methods for single domain generalization by a good margin. This +highlights the importance of domain diversification in improving the +performance of object detectors. Secondly, we introduce a method to align +detections from multiple views, considering both classification and +localization outputs. This alignment procedure leads to better generalized and +well-calibrated object detector models, which are crucial for accurate +decision-making in safety-critical applications. Our approach is +detector-agnostic and can be seamlessly applied to both single-stage and +two-stage detectors. To validate the effectiveness of our proposed methods, we +conduct extensive experiments and ablations on challenging domain-shift +scenarios. The results consistently demonstrate the superiority of our approach +compared to existing methods. Our code and models are available at: +https://github.com/msohaildanish/DivAlign",cs.CV,['cs.CV'] +Universal Segmentation at Arbitrary Granularity with Language Instruction,Yong Liu · Cairong Zhang · Yitong Wang · Jiahao Wang · Yujiu Yang · Yansong Tang, ,https://arxiv.org/abs/2312.01623,,2312.01623.pdf,Universal Segmentation at Arbitrary Granularity with Language Instruction,"This paper aims to achieve universal segmentation of arbitrary semantic +level. Despite significant progress in recent years, specialist segmentation +approaches are limited to specific tasks and data distribution. Retraining a +new model for adaptation to new scenarios or settings takes expensive +computation and time cost, which raises the demand for versatile and universal +segmentation model that can cater to various granularity. Although some +attempts have been made for unifying different segmentation tasks or +generalization to various scenarios, limitations in the definition of paradigms +and input-output spaces make it difficult for them to achieve accurate +understanding of content at arbitrary granularity. To this end, we present +UniLSeg, a universal segmentation model that can perform segmentation at any +semantic level with the guidance of language instructions. For training +UniLSeg, we reorganize a group of tasks from original diverse distributions +into a unified data format, where images with texts describing segmentation +targets as input and corresponding masks are output. Combined with a automatic +annotation engine for utilizing numerous unlabeled data, UniLSeg achieves +excellent performance on various tasks and settings, surpassing both specialist +and unified segmentation models.",cs.CV,['cs.CV'] +ScanFormer: Referring Expression Comprehension by Iteratively Scanning,Wei Su · Peihan Miao · Huanzhang Dou · Xi Li, ,http://export.arxiv.org/abs/2306.04451,,2306.04451.pdf,Referring Expression Comprehension Using Language Adaptive Inference,"Different from universal object detection, referring expression comprehension +(REC) aims to locate specific objects referred to by natural language +expressions. The expression provides high-level concepts of relevant visual and +contextual patterns, which vary significantly with different expressions and +account for only a few of those encoded in the REC model. This leads us to a +question: do we really need the entire network with a fixed structure for +various referring expressions? Ideally, given an expression, only +expression-relevant components of the REC model are required. These components +should be small in number as each expression only contains very few visual and +contextual clues. This paper explores the adaptation between expressions and +REC models for dynamic inference. Concretely, we propose a neat yet efficient +framework named Language Adaptive Dynamic Subnets (LADS), which can extract +language-adaptive subnets from the REC model conditioned on the referring +expressions. By using the compact subnet, the inference can be more economical +and efficient. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and +Referit show that the proposed method achieves faster inference speed and +higher accuracy against state-of-the-art approaches.",cs.CV,['cs.CV'] +SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing,Tomoki Ichikawa · Shohei Nobuhara · Ko Nishino,https://vision.ist.i.kyoto-u.ac.jp/research/spiders/,https://arxiv.org/abs/2312.04553,,2312.04553.pdf,SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing,"Can we capture shape and reflectance in stealth? Such capability would be +valuable for many application domains in vision, xR, robotics, and HCI. We +introduce structured polarization for invisible depth and reflectance sensing +(SPIDeRS), the first depth and reflectance sensing method using patterns of +polarized light. The key idea is to modulate the angle of linear polarization +(AoLP) of projected light at each pixel. The use of polarization makes it +invisible and lets us recover not only depth but also directly surface normals +and even reflectance. We implement SPIDeRS with a liquid crystal spatial light +modulator (SLM) and a polarimetric camera. We derive a novel method for +robustly extracting the projected structured polarization pattern from the +polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by +applying it to a number of real-world objects. The results show that our method +successfully reconstructs object shapes of various materials and is robust to +diffuse reflection and ambient light. We also demonstrate relighting using +recovered surface normals and reflectance. We believe SPIDeRS opens a new +avenue of polarization use in visual sensing.",cs.CV,"['cs.CV', 'eess.IV']" +MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning,Chaoyi Zhang · Kevin Lin · Zhengyuan Yang · Jianfeng Wang · Linjie Li · Chung-Ching Lin · Zicheng Liu · Lijuan Wang, ,https://arxiv.org/abs/2311.17435,,2311.17435.pdf,MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning,"We present MM-Narrator, a novel system leveraging GPT-4 with multimodal +in-context learning for the generation of audio descriptions (AD). Unlike +previous methods that primarily focused on downstream fine-tuning with short +video clips, MM-Narrator excels in generating precise audio descriptions for +videos of extensive lengths, even beyond hours, in an autoregressive manner. +This capability is made possible by the proposed memory-augmented generation +process, which effectively utilizes both the short-term textual context and +long-term visual memory through an efficient register-and-recall mechanism. +These contextual memories compile pertinent past information, including +storylines and character identities, ensuring an accurate tracking and +depicting of story-coherent and character-centric audio descriptions. +Maintaining the training-free design of MM-Narrator, we further propose a +complexity-based demonstration selection strategy to largely enhance its +multi-step reasoning capability via few-shot multimodal in-context learning +(MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator +consistently outperforms both the existing fine-tuning-based approaches and +LLM-based approaches in most scenarios, as measured by standard evaluation +metrics. Additionally, we introduce the first segment-based evaluator for +recurrent text generation. Empowered by GPT-4, this evaluator comprehensively +reasons and marks AD generation performance in various extendable dimensions.",cs.CV,"['cs.CV', 'cs.AI']" +DisCo: Disentangled Control for Realistic Human Dance Generation,Tan Wang · Linjie Li · Kevin Lin · Yuanhao Zhai · Chung-Ching Lin · Zhengyuan Yang · Hanwang Zhang · Zicheng Liu · Lijuan Wang, ,https://arxiv.org/abs/2307.00040,,2307.00040.pdf,DisCo: Disentangled Control for Realistic Human Dance Generation,"Generative AI has made significant strides in computer vision, particularly +in text-driven image/video synthesis (T2I/T2V). Despite the notable +advancements, it remains challenging in human-centric content synthesis such as +realistic dance generation. Current methodologies, primarily tailored for human +motion transfer, encounter difficulties when confronted with real-world dance +scenarios (e.g., social media dance), which require to generalize across a wide +spectrum of poses and intricate human details. In this paper, we depart from +the traditional paradigm of human motion transfer and emphasize two additional +critical attributes for the synthesis of human dance content in social media +contexts: (i) Generalizability: the model should be able to generalize beyond +generic human viewpoints as well as unseen human subjects, backgrounds, and +poses; (ii) Compositionality: it should allow for the seamless composition of +seen/unseen subjects, backgrounds, and poses from different sources. To address +these challenges, we introduce DISCO, which includes a novel model architecture +with disentangled control to improve the compositionality of dance synthesis, +and an effective human attribute pre-training for better generalizability to +unseen humans. Extensive qualitative and quantitative results demonstrate that +DisCc can generate high-quality human dance images and videos with diverse +appearances and flexible motions. Code is available at +https://disco-dance.github.io/.",cs.CV,"['cs.CV', 'cs.AI']" +DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning,Shihua Zhang · Zizhuo Li · Yuan Gao · Jiayi Ma, ,,https://ojs.aaai.org/index.php/AAAI/article/view/25456,,,,,nan +EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning,Hongxia Xie · Chu-Jun Peng · Yu-Wen Tseng · Hung-Jen Chen · Chan-Feng Hsu · Hong-Han Shuai · Wen-Huang Cheng, ,https://arxiv.org/abs/2404.16670,,2404.16670.pdf,EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning,"Visual Instruction Tuning represents a novel learning paradigm involving the +fine-tuning of pre-trained language models using task-specific instructions. +This paradigm shows promising zero-shot results in various natural language +processing tasks but is still unexplored in vision emotion understanding. In +this work, we focus on enhancing the model's proficiency in understanding and +adhering to instructions related to emotional contexts. Initially, we identify +key visual clues critical to visual emotion recognition. Subsequently, we +introduce a novel GPT-assisted pipeline for generating emotion visual +instruction data, effectively addressing the scarcity of annotated instruction +data in this domain. Expanding on the groundwork established by InstructBLIP, +our proposed EmoVIT architecture incorporates emotion-specific instruction +data, leveraging the powerful capabilities of Large Language Models to enhance +performance. Through extensive experiments, our model showcases its proficiency +in emotion classification, adeptness in affective reasoning, and competence in +comprehending humor. The comparative analysis provides a robust benchmark for +Emotion Visual Instruction Tuning in the era of LLMs, providing valuable +insights and opening avenues for future exploration in this domain. Our code is +available at \url{https://github.com/aimmemotion/EmoVIT}.",cs.CV,"['cs.CV', 'cs.AI']" +PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion,Ying-Tian Liu · Yuan-Chen Guo · Guan Luo · Heyi Sun · Wei Yin · Song-Hai Zhang, ,https://arxiv.org/abs/2312.09069,,2312.09069.pdf,PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion,"Diffusion models trained on large-scale text-image datasets have demonstrated +a strong capability of controllable high-quality image generation from +arbitrary text prompts. However, the generation quality and generalization +ability of 3D diffusion models is hindered by the scarcity of high-quality and +large-scale 3D datasets. In this paper, we present PI3D, a framework that fully +leverages the pre-trained text-to-image diffusion models' ability to generate +high-quality 3D shapes from text prompts in minutes. The core idea is to +connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB +Images. We fine-tune an existing text-to-image diffusion model to produce such +pseudo-images using a small number of text-3D pairs. Surprisingly, we find that +it can already generate meaningful and consistent 3D shapes given complex text +descriptions. We further take the generated shapes as the starting point for a +lightweight iterative refinement using score distillation sampling to achieve +high-quality generation under a low budget. PI3D generates a single 3D shape +from text in only 3 minutes and the quality is validated to outperform existing +3D generative models by a large margin.",cs.CV,['cs.CV'] +VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction,Jiaqi Lin · Zhihao Li · Xiao Tang · Jianzhuang Liu · Shiyong Liu · Jiayue Liu · Yangdi Lu · Xiaofei Wu · Songcen Xu · Youliang Yan · Wenming Yang, ,https://arxiv.org/abs/2402.17427,,2402.17427.pdf,VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction,"Existing NeRF-based methods for large scene reconstruction often have +limitations in visual quality and rendering speed. While the recent 3D Gaussian +Splatting works well on small-scale and object-centric scenes, scaling it up to +large scenes poses challenges due to limited video memory, long optimization +time, and noticeable appearance variations. To address these challenges, we +present VastGaussian, the first method for high-quality reconstruction and +real-time rendering on large scenes based on 3D Gaussian Splatting. We propose +a progressive partitioning strategy to divide a large scene into multiple +cells, where the training cameras and point cloud are properly distributed with +an airspace-aware visibility criterion. These cells are merged into a complete +scene after parallel optimization. We also introduce decoupled appearance +modeling into the optimization process to reduce appearance variations in the +rendered images. Our approach outperforms existing NeRF-based methods and +achieves state-of-the-art results on multiple large scene datasets, enabling +fast optimization and high-fidelity real-time rendering.",cs.CV,['cs.CV'] +Open-Vocabulary Segmentation with Semantic-Assisted Calibration,Yong Liu · Sule Bai · Guanbin Li · Yitong Wang · Yansong Tang, ,https://arxiv.org/abs/2312.04089,,,Open-Vocabulary Segmentation with Semantic-Assisted Calibration,"This paper studies open-vocabulary segmentation (OVS) through calibrating +in-vocabulary and domain-biased embedding space with generalized contextual +prior of CLIP. As the core of open-vocabulary understanding, alignment of +visual content with the semantics of unbounded text has become the bottleneck +of this field. To address this challenge, recent works propose to utilize CLIP +as an additional classifier and aggregate model predictions with CLIP +classification results. Despite their remarkable progress, performance of OVS +methods in relevant scenarios is still unsatisfactory compared with supervised +counterparts. We attribute this to the in-vocabulary embedding and +domain-biased CLIP prediction. To this end, we present a Semantic-assisted +CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior +of CLIP into proposal embedding to avoid collapsing on known categories. +Besides, a contextual shift strategy is applied to mitigate the lack of global +context and unnatural background noise. With above designs, SCAN achieves +state-of-the-art performance on all popular open-vocabulary segmentation +benchmarks. Furthermore, we also focus on the problem of existing evaluation +system that ignores semantic duplication across categories, and propose a new +metric called Semantic-Guided IoU (SG-IoU).",cs.CV,['cs.CV'] +GPT4Point: A Unified Framework for Point-Language Understanding and Generation,Zhangyang Qi · Ye Fang · Zeyi Sun · Xiaoyang Wu · Tong Wu · Jiaqi Wang · Dahua Lin · Hengshuang Zhao, ,https://arxiv.org/abs/2312.02980,,2312.02980.pdf,GPT4Point: A Unified Framework for Point-Language Understanding and Generation,"Multimodal Large Language Models (MLLMs) have excelled in 2D image-text +comprehension and image generation, but their understanding of the 3D world is +notably deficient, limiting progress in 3D language understanding and +generation. To solve this problem, we introduce GPT4Point, an innovative +groundbreaking point-language multimodal model designed specifically for +unified 3D object understanding and generation within the MLLM framework. +GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text +reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point +is equipped with advanced capabilities for controllable 3D generation, it can +get high-quality results through a low-quality point-text feature maintaining +the geometric shapes and colors. To support the expansive needs of 3D +object-text pairs, we develop Pyramid-XL, a point-language dataset annotation +engine. It constructs a large-scale database over 1M objects of varied text +granularity levels from the Objaverse-XL dataset, essential for training +GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D +point-language understanding capabilities. In extensive evaluations, GPT4Point +has demonstrated superior performance in understanding and generation.",cs.CV,['cs.CV'] +FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation,Pengchong Qiao · Lei Shang · Chang Liu · Baigui Sun · Xiangyang Ji · Jie Chen, ,,https://paperswithcode.com/paper/facechain-sude-building-derived-class-to,,,,,nan +TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding,Yun Liu · Haolin Yang · Xu Si · Ling Liu · Zipeng Li · Yuxiang Zhang · Yebin Liu · Li Yi, ,https://arxiv.org/abs/2401.08399,,2401.08399.pdf,TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding,"Humans commonly work with multiple objects in daily life and can intuitively +transfer manipulation skills to novel objects by understanding object +functional regularities. However, existing technical approaches for analyzing +and synthesizing hand-object manipulation are mostly limited to handling a +single hand and object due to the lack of data support. To address this, we +construct TACO, an extensive bimanual hand-object-interaction dataset spanning +a large variety of tool-action-object compositions for daily human activities. +TACO contains 2.5K motion sequences paired with third-person and egocentric +views, precise hand-object 3D meshes, and action labels. To rapidly expand the +data scale, we present a fully automatic data acquisition pipeline combining +multi-view sensing with an optical motion capture system. With the vast +research fields provided by TACO, we benchmark three generalizable +hand-object-interaction tasks: compositional action recognition, generalizable +hand-object motion forecasting, and cooperative grasp synthesis. Extensive +experiments reveal new insights, challenges, and opportunities for advancing +the studies of generalizable hand-object motion analysis and synthesis. Our +data and code are available at https://taco2024.github.io.",cs.CV,['cs.CV'] +Alpha-CLIP: A CLIP Model Focusing on Wherever You Want,Zeyi Sun · Ye Fang · Tong Wu · Pan Zhang · Yuhang Zang · Shu Kong · Yuanjun Xiong · Dahua Lin · Jiaqi Wang,https://aleafy.github.io/alpha-clip/,https://arxiv.org/abs/2312.03818,,2312.03818.pdf,Alpha-CLIP: A CLIP Model Focusing on Wherever You Want,"Contrastive Language-Image Pre-training (CLIP) plays an essential role in +extracting valuable content information from images across diverse tasks. It +aligns textual and visual modalities to comprehend the entire image, including +all the details, even those irrelevant to specific tasks. However, for a finer +understanding and controlled editing of images, it becomes crucial to focus on +specific regions of interest, which can be indicated as points, masks, or boxes +by humans or perception models. To fulfill the requirements, we introduce +Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to +suggest attentive regions and fine-tuned with constructed millions of RGBA +region-text pairs. Alpha-CLIP not only preserves the visual recognition ability +of CLIP but also enables precise control over the emphasis of image contents. +It demonstrates effectiveness in various tasks, including but not limited to +open-world recognition, multimodal large language models, and conditional 2D / +3D generation. It has a strong potential to serve as a versatile tool for +image-related tasks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +VCoder: Versatile Vision Encoders for Multimodal Large Language Models,Jitesh Jain · Jianwei Yang · Humphrey Shi,https://praeclarumjj3.github.io/vcoder/,https://arxiv.org/abs/2312.14233,,2312.14233.pdf,VCoder: Versatile Vision Encoders for Multimodal Large Language Models,"Humans possess the remarkable skill of Visual Perception, the ability to see +and understand the seen, helping them make sense of the visual world and, in +turn, reason. Multimodal Large Language Models (MLLM) have recently achieved +impressive performance on vision-language tasks ranging from visual +question-answering and image captioning to visual reasoning and image +generation. However, when prompted to identify or count (perceive) the entities +in a given image, existing MLLM systems fail. Working towards developing an +accurate MLLM system for perception and reasoning, we propose using Versatile +vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the +VCoder with perception modalities such as segmentation or depth maps, improving +the MLLM's perception abilities. Secondly, we leverage the images from COCO and +outputs from off-the-shelf vision perception models to create our COCO +Segmentation Text (COST) dataset for training and evaluating MLLMs on the +object perception task. Thirdly, we introduce metrics to assess the object +perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive +experimental evidence proving the VCoder's improved object-level perception +skills over existing Multimodal LLMs, including GPT-4V. We open-source our +dataset, code, and models to promote research. We open-source our code at +https://github.com/SHI-Labs/VCoder",cs.CV,['cs.CV'] +Emotional Speech-Driven 3D Body Animation via Disentangled Latent Diffusion,Kiran Chhatre · Radek Danecek · Nikos Athanasiou · Giorgio Becherini · Christopher Peters · Michael J. Black · Timo Bolkart,https://amuse.is.tue.mpg.de/,https://arxiv.org/abs/2312.04466,,2312.04466.pdf,Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion,"Existing methods for synthesizing 3D human gestures from speech have shown +promising results, but they do not explicitly model the impact of emotions on +the generated gestures. Instead, these methods directly output animations from +speech without control over the expressed emotion. To address this limitation, +we present AMUSE, an emotional speech-driven body animation model based on +latent diffusion. Our observation is that content (i.e., gestures related to +speech rhythm and word utterances), emotion, and personal style are separable. +To account for this, AMUSE maps the driving audio to three disentangled latent +vectors: one for content, one for emotion, and one for personal style. A latent +diffusion model, trained to generate gesture motion sequences, is then +conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human +gestures directly from speech with control over the expressed emotions and +style by combining the content from the driving speech with the emotion and +style of another speech sequence. Randomly sampling the noise of the diffusion +model further generates variations of the gesture with the same emotional +expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate +that AMUSE outputs realistic gesture sequences. Compared to the state of the +art, the generated gestures are better synchronized with the speech content, +and better represent the emotion expressed by the input speech. Our code is +available at amuse.is.tue.mpg.de.",cs.CV,['cs.CV'] +Accept the Modality Gap: An Exploration in the Hyperbolic Space,Sameera Ramasinghe · Violetta Shevchenko · Gil Avraham · Thalaiyasingam Ajanthan, ,,https://openreview.net/forum?id=KiespDPaRH,,,,,nan +Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training,Di Ming · Peng Ren · Yunlong Wang · Xin Feng,https://github.com/MisterRpeng/EGS-TSSA,,https://midasdming.github.io/news/announcement_17/,,,,,nan +DreamComposer: Controllable 3D Object Generation via Multi-View Conditions,Yunhan Yang · Yukun Huang · Xiaoyang Wu · Yuan-Chen Guo · Song-Hai Zhang · Hengshuang Zhao · Tong He · Xihui Liu, ,https://arxiv.org/abs/2312.03611,,2312.03611.pdf,DreamComposer: Controllable 3D Object Generation via Multi-View Conditions,"Utilizing pre-trained 2D large-scale generative models, recent works are +capable of generating high-quality novel views from a single in-the-wild image. +However, due to the lack of information from multiple views, these works +encounter difficulties in generating controllable novel views. In this paper, +we present DreamComposer, a flexible and scalable framework that can enhance +existing view-aware diffusion models by injecting multi-view conditions. +Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain +3D representations of an object from multiple views. Then, it renders the +latent features of the target view from 3D representations with the multi-view +feature fusion module. Finally the target view features extracted from +multi-view inputs are injected into a pre-trained diffusion model. Experiments +show that DreamComposer is compatible with state-of-the-art diffusion models +for zero-shot novel view synthesis, further enhancing them to generate +high-fidelity novel view images with multi-view conditions, ready for +controllable 3D object reconstruction and various other applications.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Pose Adapted Shape Learning for Large-Pose Face Reenactment,Gee-Sern Hsu · Jie-Ying Zhang · Yu-Hsiang Huang · Wei-Jie Hong, ,,https://ieeexplore.ieee.org/abstract/document/10219601,,,,,nan +LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding,Min Liang · Jia-Wei Ma · Xiaobin Zhu · Jingyan Qin · Xu-Cheng Yin, ,https://ar5iv.labs.arxiv.org/html/2207.12955,,2207.12955.pdf,Contextual Text Block Detection towards Scene Text Understanding,"Most existing scene text detectors focus on detecting characters or words +that only capture partial text messages due to missing contextual information. +For a better understanding of text in scenes, it is more desired to detect +contextual text blocks (CTBs) which consist of one or multiple integral text +units (e.g., characters, words, or phrases) in natural reading order and +transmit certain complete text messages. This paper presents contextual text +detection, a new setup that detects CTBs for better understanding of texts in +scenes. We formulate the new setup by a dual detection task which first detects +integral text units and then groups them into a CTB. To this end, we design a +novel scene text clustering technique that treats integral text units as tokens +and groups them (belonging to the same CTB) into an ordered token sequence. In +addition, we create two datasets SCUT-CTW-Context and ReCTS-Context to +facilitate future research, where each CTB is well annotated by an ordered +sequence of integral text units. Further, we introduce three metrics that +measure contextual text detection in local accuracy, continuity, and global +accuracy. Extensive experiments show that our method accurately detects CTBs +which effectively facilitates downstream tasks such as text classification and +translation. The project is available at +https://sg-vilab.github.io/publication/xue2022contextual/.",cs.CV,['cs.CV'] +PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos,Qi Zhao · M. Salman Asif · Zhan Ma, ,https://arxiv.org/abs/2404.08921,,2404.08921.pdf,PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos,"The primary focus of Neural Representation for Videos (NeRV) is to +effectively model its spatiotemporal consistency. However, current NeRV systems +often face a significant issue of spatial inconsistency, leading to decreased +perceptual quality. To address this issue, we introduce the Pyramidal Neural +Representation for Videos (PNeRV), which is built on a multi-scale information +connection and comprises a lightweight rescaling operator, Kronecker +Fully-connected layer (KFc), and a Benign Selective Memory (BSM) mechanism. The +KFc, inspired by the tensor decomposition of the vanilla Fully-connected layer, +facilitates low-cost rescaling and global correlation modeling. BSM merges +high-level features with granular ones adaptively. Furthermore, we provide an +analysis based on the Universal Approximation Theory of the NeRV system and +validate the effectiveness of the proposed PNeRV.We conducted comprehensive +experiments to demonstrate that PNeRV surpasses the performance of contemporary +NeRV models, achieving the best results in video regression on UVG and DAVIS +under various metrics (PSNR, SSIM, LPIPS, and FVD). Compared to vanilla NeRV, +PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG, along +with a +3.28 dB PSNR and 634% FVD increase on DAVIS.",cs.CV,['cs.CV'] +Bézier Everywhere All at Once: Learning Drivable Lanes as Bézier Graphs,Hugh Blayney · Hanlin Tian · Hamish Scott · Nils Goldbeck · Chess Stetson · Panagiotis Angeloudis, ,,https://screenrant.com/everything-everywhere-all-at-once-real-meaning-explained/,,,,,nan +Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer,Jiwoo Chung · Sangeek Hyun · Jae-Pil Heo,https://jiwoogit.github.io/StyleID_site/,https://arxiv.org/abs/2312.09008,,2312.09008.pdf,Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer,"Despite the impressive generative capabilities of diffusion models, existing +diffusion model-based style transfer methods require inference-stage +optimization (e.g. fine-tuning or textual inversion of style) which is +time-consuming, or fails to leverage the generative ability of large-scale +diffusion models. To address these issues, we introduce a novel artistic style +transfer method based on a pre-trained large-scale diffusion model without any +optimization. Specifically, we manipulate the features of self-attention layers +as the way the cross-attention mechanism works; in the generation process, +substituting the key and value of content with those of style image. This +approach provides several desirable characteristics for style transfer +including 1) preservation of content by transferring similar styles into +similar image patches and 2) transfer of style based on similarity of local +texture (e.g. edge) between content and style images. Furthermore, we introduce +query preservation and attention temperature scaling to mitigate the issue of +disruption of original content, and initial latent Adaptive Instance +Normalization (AdaIN) to deal with the disharmonious color (failure to transfer +the colors of style). Our experimental results demonstrate that our proposed +method surpasses state-of-the-art methods in both conventional and +diffusion-based style transfer baselines.",cs.CV,['cs.CV'] +Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes,Diandian Guo · Deng-Ping Fan · Tongyu Lu · Christos Sakaridis · Luc Van Gool,https://github.com/RascalGdd/VPSeg,https://arxiv.org/abs/2401.15261,,2401.15261.pdf,Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes,"The estimation of implicit cross-frame correspondences and the high +computational cost have long been major challenges in video semantic +segmentation (VSS) for driving scenes. Prior works utilize keyframes, feature +propagation, or cross-frame attention to address these issues. By contrast, we +are the first to harness vanishing point (VP) priors for more effective +segmentation. Intuitively, objects near VPs (i.e., away from the vehicle) are +less discernible. Moreover, they tend to move radially away from the VP over +time in the usual case of a forward-facing camera, a straight road, and linear +forward motion of the vehicle. Our novel, efficient network for VSS, named +VPSeg, incorporates two modules that utilize exactly this pair of static and +dynamic VP priors: sparse-to-dense feature mining (DenseVP) and VP-guided +motion fusion (MotionVP). MotionVP employs VP-guided motion estimation to +establish explicit correspondences across frames and help attend to the most +relevant features from neighboring frames, while DenseVP enhances weak dynamic +features in distant regions around VPs. These modules operate within a +context-detail framework, which separates contextual features from +high-resolution local features at different input resolutions to reduce +computational costs. Contextual and local features are integrated through +contextualized motion attention (CMA) for the final prediction. Extensive +experiments on two popular driving segmentation benchmarks, Cityscapes and +ACDC, demonstrate that VPSeg outperforms previous SOTA methods, with only +modest computational overhead.",cs.CV,['cs.CV'] +TransNeXt: Robust Foveal Visual Perception for Vision Transformers,Dai Shi, ,https://arxiv.org/abs/2311.17132,,2311.17132.pdf,TransNeXt: Robust Foveal Visual Perception for Vision Transformers,"Due to the depth degradation effect in residual connections, many efficient +Vision Transformers models that rely on stacking layers for information +exchange often fail to form sufficient information mixing, leading to unnatural +visual perception. To address this issue, in this paper, we propose Aggregated +Attention, a biomimetic design-based token mixer that simulates biological +foveal vision and continuous eye movement while enabling each token on the +feature map to have a global perception. Furthermore, we incorporate learnable +tokens that interact with conventional queries and keys, which further +diversifies the generation of affinity matrices beyond merely relying on the +similarity between queries and keys. Our approach does not rely on stacking for +information exchange, thus effectively avoiding depth degradation and achieving +natural visual perception. Additionally, we propose Convolutional GLU, a +channel mixer that bridges the gap between GLU and SE mechanism, which empowers +each token to have channel attention based on its nearest neighbor image +features, enhancing local modeling capability and model robustness. We combine +aggregated attention and convolutional GLU to create a new visual backbone +called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves +state-of-the-art performance across multiple model sizes. At a resolution of +$224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing +ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet +accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of +$384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic +segmentation mIoU of 54.7.",cs.CV,"['cs.CV', 'cs.AI']" +Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects,Yijia Weng · Bowen Wen · Jonathan Tremblay · Valts Blukis · Dieter Fox · Leonidas Guibas · Stan Birchfield,https://nvlabs.github.io/DigitalTwinArt/,https://arxiv.org/abs/2404.01440,,2404.01440.pdf,Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects,"We address the problem of building digital twins of unknown articulated +objects from two RGBD scans of the object at different articulation states. We +decompose the problem into two stages, each addressing distinct aspects. Our +method first reconstructs object-level shape at each state, then recovers the +underlying articulation model including part segmentation and joint +articulations that associate the two states. By explicitly modeling point-level +correspondences and exploiting cues from images, 3D reconstructions, and +kinematics, our method yields more accurate and stable results compared to +prior work. It also handles more than one movable part and does not rely on any +object shape or structure priors. Project page: +https://github.com/NVlabs/DigitalTwinArt",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.RO']" +MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,Xiang Yue · Yuansheng Ni · Kai Zhang · Tianyu Zheng · Ruoqi Liu · Ge Zhang · Samuel Stevens · Dongfu Jiang · Weiming Ren · Yuxuan Sun · Cong Wei · Botao Yu · Ruibin Yuan · Renliang Sun · Ming Yin · Boyuan Zheng · Zhenzhu Yang · Yibo Liu · Wenhao Huang · Huan Sun · Yu Su · Wenhu Chen,https://mmmu-benchmark.github.io/,https://arxiv.org/abs/2311.16502,,2311.16502.pdf,MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,"We introduce MMMU: a new benchmark designed to evaluate multimodal models on +massive multi-discipline tasks demanding college-level subject knowledge and +deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal +questions from college exams, quizzes, and textbooks, covering six core +disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & +Social Science, and Tech & Engineering. These questions span 30 subjects and +183 subfields, comprising 30 highly heterogeneous image types, such as charts, +diagrams, maps, tables, music sheets, and chemical structures. Unlike existing +benchmarks, MMMU focuses on advanced perception and reasoning with +domain-specific knowledge, challenging models to perform tasks akin to those +faced by experts. The evaluation of 14 open-source LMMs as well as the +proprietary GPT-4V(ision) and Gemini highlights the substantial challenges +posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve +accuracies of 56% and 59% respectively, indicating significant room for +improvement. We believe MMMU will stimulate the community to build +next-generation multimodal foundation models towards expert artificial general +intelligence.",cs.CL,"['cs.CL', 'cs.AI', 'cs.CV']" +Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models,Takami Sato · Justin Yue · Nanze Chen · Ningfei Wang · Alfred Chen, ,https://arxiv.org/abs/2308.15692,,2308.15692.pdf,Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models,"Denoising probabilistic diffusion models have shown breakthrough performance +to generate more photo-realistic images or human-level illustrations than the +prior models such as GANs. This high image-generation capability has stimulated +the creation of many downstream applications in various areas. However, we find +that this technology is actually a double-edged sword: We identify a new type +of attack, called the Natural Denoising Diffusion (NDD) attack based on the +finding that state-of-the-art deep neural network (DNN) models still hold their +prediction even if we intentionally remove their robust features, which are +essential to the human visual system (HVS), through text prompts. The NDD +attack shows a significantly high capability to generate low-cost, +model-agnostic, and transferable adversarial attacks by exploiting the natural +attack capability in diffusion models. To systematically evaluate the risk of +the NDD attack, we perform a large-scale empirical study with our newly created +dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the +natural attack capability by answering 6 research questions. Through a user +study, we find that it can achieve an 88% detection rate while being stealthy +to 93% of human subjects; we also find that the non-robust features embedded by +diffusion models contribute to the natural attack capability. To confirm the +model-agnostic and transferable attack capability, we perform the NDD attack +against the Tesla Model 3 and find that 73% of the physically printed attacks +can be detected as stop signs. Our hope is that the study and dataset can help +our community be aware of the risks in diffusion models and facilitate further +research toward robust DNN models.",cs.CV,"['cs.CV', 'cs.CR']" +SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting,Hoon Kim · Minje Jang · Wonjun Yoon · Jisoo Lee · Donghyun Na · Sanghyun Woo, ,https://arxiv.org/abs/2402.18848,,2402.18848.pdf,SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting,"We introduce a co-designed approach for human portrait relighting that +combines a physics-guided architecture with a pre-training framework. Drawing +on the Cook-Torrance reflectance model, we have meticulously configured the +architecture design to precisely simulate light-surface interactions. +Furthermore, to overcome the limitation of scarce high-quality lightstage data, +we have developed a self-supervised pre-training strategy. This novel +combination of accurate physical modeling and expanded training dataset +establishes a new benchmark in relighting realism.",cs.CV,['cs.CV'] +Context-Aware Integration of Language and Visual References for Natural Language Tracking,Yanyan Shao · Shuting He · Qi Ye · Yuchao Feng · Wenhan Luo · Jiming Chen,https://github.com/twotwo2/QueryNLT,https://arxiv.org/abs/2403.19975,,2403.19975.pdf,Context-Aware Integration of Language and Visual References for Natural Language Tracking,"Tracking by natural language specification (TNL) aims to consistently +localize a target in a video sequence given a linguistic description in the +initial frame. Existing methodologies perform language-based and template-based +matching for target reasoning separately and merge the matching results from +two sources, which suffer from tracking drift when language and visual +templates miss-align with the dynamic target state and ambiguity in the later +merging stage. To tackle the issues, we propose a joint multi-modal tracking +framework with 1) a prompt modulation module to leverage the complementarity +between temporal visual templates and language expressions, enabling precise +and context-aware appearance and linguistic cues, and 2) a unified target +decoding module to integrate the multi-modal reference cues and executes the +integrated queries on the search image to predict the target location in an +end-to-end manner directly. This design ensures spatio-temporal consistency by +leveraging historical visual information and introduces an integrated solution, +generating predictions in a single step. Extensive experiments conducted on +TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed +approach. The results demonstrate competitive performance against +state-of-the-art methods for both tracking and grounding.",cs.CV,['cs.CV'] +Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities,Mingcheng Li · Dingkang Yang · Xiao Zhao · Shuaibing Wang · Yan Wang · Kun Yang · Mingyang Sun · Dongliang Kou · Qian · Lihua Zhang, ,https://arxiv.org/abs/2404.16456,,2404.16456.pdf,Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities,"Multimodal sentiment analysis (MSA) aims to understand human sentiment +through multimodal data. Most MSA efforts are based on the assumption of +modality completeness. However, in real-world applications, some practical +factors cause uncertain modality missingness, which drastically degrades the +model's performance. To this end, we propose a Correlation-decoupled Knowledge +Distillation (CorrKD) framework for the MSA task under uncertain missing +modalities. Specifically, we present a sample-level contrastive distillation +mechanism that transfers comprehensive knowledge containing cross-sample +correlations to reconstruct missing semantics. Moreover, a category-guided +prototype distillation mechanism is introduced to capture cross-category +correlations using category prototypes to align feature distributions and +generate favorable joint representations. Eventually, we design a +response-disentangled consistency distillation strategy to optimize the +sentiment decision boundaries of the student network through response +disentanglement and mutual information maximization. Comprehensive experiments +on three datasets indicate that our framework can achieve favorable +improvements compared with several baselines.",cs.CV,['cs.CV'] +Pose-Transformed Equivariant Network for 3D Point Trajectory Prediction,Ruixuan Yu · Jian Sun, ,https://arxiv.org/abs/2308.06564,,2308.06564.pdf,EquiDiff: A Conditional Equivariant Diffusion Model For Trajectory Prediction,"Accurate trajectory prediction is crucial for the safe and efficient +operation of autonomous vehicles. The growing popularity of deep learning has +led to the development of numerous methods for trajectory prediction. While +deterministic deep learning models have been widely used, deep generative +models have gained popularity as they learn data distributions from training +data and account for trajectory uncertainties. In this study, we propose +EquiDiff, a deep generative model for predicting future vehicle trajectories. +EquiDiff is based on the conditional diffusion model, which generates future +trajectories by incorporating historical information and random Gaussian noise. +The backbone model of EquiDiff is an SO(2)-equivariant transformer that fully +utilizes the geometric properties of location coordinates. In addition, we +employ Recurrent Neural Networks and Graph Attention Networks to extract social +interactions from historical trajectories. To evaluate the performance of +EquiDiff, we conduct extensive experiments on the NGSIM dataset. Our results +demonstrate that EquiDiff outperforms other baseline models in short-term +prediction, but has slightly higher errors for long-term prediction. +Furthermore, we conduct an ablation study to investigate the contribution of +each component of EquiDiff to the prediction accuracy. Additionally, we present +a visualization of the generation process of our diffusion model, providing +insights into the uncertainty of the prediction.",cs.LG,"['cs.LG', 'cs.RO']" +SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models,Tongtian Yue · Jie Cheng · Longteng Guo · Xingyuan Dai · Zijia Zhao · Xingjian He · Gang Xiong · Yisheng Lv · Jing Liu, ,https://arxiv.org/abs/2403.13263,,2403.13263.pdf,SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models,"Recent trends in Large Vision Language Models (LVLMs) research have been +increasingly focusing on advancing beyond general image understanding towards +more nuanced, object-level referential comprehension. In this paper, we present +and delve into the self-consistency capability of LVLMs, a crucial aspect that +reflects the models' ability to both generate informative captions for specific +objects and subsequently utilize these captions to accurately re-identify the +objects in a closed-loop process. This capability significantly mirrors the +precision and reliability of fine-grained visual-language understanding. Our +findings reveal that the self-consistency level of existing LVLMs falls short +of expectations, posing limitations on their practical applicability and +potential. To address this gap, we introduce a novel fine-tuning paradigm named +Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a +cyclic describer-locator system. This paradigm is not only data-efficient but +also exhibits generalizability across multiple LVLMs. Through extensive +experiments, we demonstrate that SC-Tune significantly elevates performance +across a spectrum of object-level vision-language benchmarks and maintains +competitive or improved performance on image-level vision-language benchmarks. +Both our model and code will be publicly available at +https://github.com/ivattyue/SC-Tune.",cs.CV,['cs.CV'] +Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models,Huimin Huang · Yawen Huang · Lanfen Lin · Ruofeng Tong · Yen-Wei Chen · Hao Zheng · Yuexiang Li · Yefeng Zheng, ,https://arxiv.org/abs/2405.14136,,,Efficient Multitask Dense Predictor via Binarization,"Multi-task learning for dense prediction has emerged as a pivotal area in +computer vision, enabling simultaneous processing of diverse yet interrelated +pixel-wise prediction tasks. However, the substantial computational demands of +state-of-the-art (SoTA) models often limit their widespread deployment. This +paper addresses this challenge by introducing network binarization to compress +resource-intensive multi-task dense predictors. Specifically, our goal is to +significantly accelerate multi-task dense prediction models via Binary Neural +Networks (BNNs) while maintaining and even improving model performance at the +same time. To reach this goal, we propose a Binary Multi-task Dense Predictor, +Bi-MTDP, and several variants of Bi-MTDP, in which a multi-task dense predictor +is constructed via specified binarized modules. Our systematical analysis of +this predictor reveals that performance drop from binarization is primarily +caused by severe information degradation. To address this issue, we introduce a +deep information bottleneck layer that enforces representations for downstream +tasks satisfying Gaussian distribution in forward propagation. Moreover, we +introduce a knowledge distillation mechanism to correct the direction of +information flow in backward propagation. Intriguingly, one variant of Bi-MTDP +outperforms full-precision (FP) multi-task dense prediction SoTAs, ARTC +(CNN-based) and InvPT (ViT-Based). This result indicates that Bi-MTDP is not +merely a naive trade-off between performance and efficiency, but is rather a +benefit of the redundant information flow thanks to the multi-task +architecture. Code is available at https://github.com/42Shawn/BiMTDP.",cs.CV,['cs.CV'] +Clustering Propagation for Universal Medical Image Segmentation,Yuhang Ding · Liulei Li · Wenguan Wang · Yi Yang, ,https://arxiv.org/abs/2403.16646,,2403.16646.pdf,Clustering Propagation for Universal Medical Image Segmentation,"Prominent solutions for medical image segmentation are typically tailored for +automatic or interactive setups, posing challenges in facilitating progress +achieved in one task to another.$_{\!}$ This$_{\!}$ also$_{\!}$ +necessitates$_{\!}$ separate$_{\!}$ models for each task, duplicating both +training time and parameters.$_{\!}$ To$_{\!}$ address$_{\!}$ above$_{\!}$ +issues,$_{\!}$ we$_{\!}$ introduce$_{\!}$ S2VNet,$_{\!}$ a$_{\!}$ +universal$_{\!}$ framework$_{\!}$ that$_{\!}$ leverages$_{\!}$ +Slice-to-Volume$_{\!}$ propagation$_{\!}$ to$_{\!}$ unify automatic/interactive +segmentation within a single model and one training session. Inspired by +clustering-based segmentation techniques, S2VNet makes full use of the +slice-wise structure of volumetric data by initializing cluster centers from +the cluster$_{\!}$ results$_{\!}$ of$_{\!}$ previous$_{\!}$ slice.$_{\!}$ This +enables knowledge acquired from prior slices to assist in the segmentation of +the current slice, further efficiently bridging the communication between +remote slices using mere 2D networks. Moreover, such a framework readily +accommodates interactive segmentation with no architectural change, simply by +initializing centroids from user inputs. S2VNet distinguishes itself by swift +inference speeds and reduced memory consumption compared to prevailing 3D +solutions. It can also handle multi-class interactions with each of them +serving to initialize different centroids. Experiments on three benchmarks +demonstrate S2VNet surpasses task-specified solutions on both +automatic/interactive setups.",cs.CV,['cs.CV'] +Tri-Modal Motion Retrieval by Learning a Joint Embedding Space,Kangning Yin · Shihao Zou · Yuxuan Ge · Zheng Tian, ,https://arxiv.org/abs/2403.00691,,2403.00691.pdf,Tri-Modal Motion Retrieval by Learning a Joint Embedding Space,"Information retrieval is an ever-evolving and crucial research domain. The +substantial demand for high-quality human motion data especially in online +acquirement has led to a surge in human motion research works. Prior works have +mainly concentrated on dual-modality learning, such as text and motion tasks, +but three-modality learning has been rarely explored. Intuitively, an extra +introduced modality can enrich a model's application scenario, and more +importantly, an adequate choice of the extra modality can also act as an +intermediary and enhance the alignment between the other two disparate +modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion +alignment), a novel framework for three-modality learning integrating +human-centric videos as an additional modality, thereby effectively bridging +the gap between text and motion. Moreover, our approach leverages a specially +designed attention mechanism to foster enhanced alignment and synergistic +effects among text, video, and motion modalities. Empirically, our results on +the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art +performance in various motion-related cross-modal retrieval tasks, including +text-to-motion, motion-to-text, video-to-motion and motion-to-video.",cs.CV,"['cs.CV', 'cs.AI']" +Rethinking Human Motion Prediction with Symplectic Integral,Haipeng Chen · Kedi L yu · Zhenguang Liu · Yifang Yin · Xun Yang · Yingda Lyu, ,https://arxiv.org/abs/2312.06184,,2312.06184.pdf,Recent Advances in Deterministic Human Motion Prediction: A Review,"In recent years, with the continuous advancement of deep learning and the +emergence of large-scale human motion datasets, human motion prediction +technology has gradually gained prominence in various fields such as +human-computer interaction, autonomous driving, sports analysis, and personnel +tracking. This article introduces common model architectures in this domain +along with their respective advantages and disadvantages. It also +systematically summarizes recent research innovations, focusing on in-depth +discussions of relevant papers in these areas, thereby highlighting +forward-looking insights into the field's development. Furthermore, this paper +provides a comprehensive overview of existing methods, commonly used datasets, +and evaluation metrics in this field. Finally, it discusses some of the current +limitations in the field and proposes potential future research directions to +address these challenges and promote further advancements in human motion +prediction.",cs.CV,['cs.CV'] +UniPAD: A Universal Pre-training Paradigm for Autonomous Driving,Honghui Yang · Sha Zhang · Di Huang · Xiaoyang Wu · Haoyi Zhu · Tong He · SHIXIANG TANG · Hengshuang Zhao · Qibo Qiu · Binbin Lin · Xiaofei He · Wanli Ouyang,https://github.com/Nightmare-n/UniPAD,https://arxiv.org/abs/2310.08370,,2310.08370.pdf,UniPAD: A Universal Pre-training Paradigm for Autonomous Driving,"In the context of autonomous driving, the significance of effective feature +learning is widely acknowledged. While conventional 3D self-supervised +pre-training methods have shown widespread success, most methods follow the +ideas originally designed for 2D images. In this paper, we present UniPAD, a +novel self-supervised learning paradigm applying 3D volumetric differentiable +rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction +of continuous 3D shape structures and the intricate appearance characteristics +of their 2D projections. The flexibility of our method enables seamless +integration into both 2D and 3D frameworks, enabling a more holistic +comprehension of the scenes. We manifest the feasibility and effectiveness of +UniPAD by conducting extensive experiments on various downstream 3D tasks. Our +method significantly improves lidar-, camera-, and lidar-camera-based baseline +by 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline +achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic +segmentation on the nuScenes validation set, achieving state-of-the-art results +in comparison with previous methods. The code will be available at +https://github.com/Nightmare-n/UniPAD.",cs.CV,['cs.CV'] +Dual-Enhanced Coreset Selection with Class-wise Collaboration for Online Blurry Class Incremental Learning,Yutian Luo · Shiqi Zhao · Haoran Wu · Zhiwu Lu, ,https://arxiv.org/abs/2308.09303,,2308.09303.pdf,Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning,"Continual learning aims to learn a model from a continuous stream of data, +but it mainly assumes a fixed number of data and tasks with clear task +boundaries. However, in real-world scenarios, the number of input data and +tasks is constantly changing in a statistical way, not a static way. Although +recently introduced incremental learning scenarios having blurry task +boundaries somewhat address the above issues, they still do not fully reflect +the statistical properties of real-world situations because of the fixed ratio +of disjoint and blurry samples. In this paper, we propose a new Stochastic +incremental Blurry task boundary scenario, called Si-Blurry, which reflects the +stochastic properties of the real-world. We find that there are two major +challenges in the Si-Blurry scenario: (1) inter- and intra-task forgettings and +(2) class imbalance problem. To alleviate them, we introduce Mask and Visual +Prompt tuning (MVP). In MVP, to address the inter- and intra-task forgetting +issues, we propose a novel instance-wise logit masking and contrastive visual +prompt tuning loss. Both of them help our model discern the classes to be +learned in the current batch. It results in consolidating the previous +knowledge. In addition, to alleviate the class imbalance problem, we introduce +a new gradient similarity-based focal loss and adaptive feature scaling to ease +overfitting to the major classes and underfitting to the minor classes. +Extensive experiments show that our proposed MVP significantly outperforms the +existing state-of-the-art methods in our challenging Si-Blurry scenario.",cs.CV,"['cs.CV', 'cs.LG']" +Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection,Zhiwei Yang · Jing Liu · Peng Wu, ,https://arxiv.org/abs/2404.08531,,2404.08531.pdf,Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection,"Weakly supervised video anomaly detection (WSVAD) is a challenging task. +Generating fine-grained pseudo-labels based on weak-label and then +self-training a classifier is currently a promising solution. However, since +the existing methods use only RGB visual modality and the utilization of +category text information is neglected, thus limiting the generation of more +accurate pseudo-labels and affecting the performance of self-training. Inspired +by the manual labeling process based on the event description, in this paper, +we propose a novel pseudo-label generation and self-training framework based on +Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer +the rich language-visual knowledge of the contrastive language-image +pre-training (CLIP) model for aligning the video event description text and +corresponding video frames to generate pseudo-labels. Specifically, We first +fine-tune the CLIP for domain adaptation by designing two ranking losses and a +distributional inconsistency loss. Further, we propose a learnable text prompt +mechanism with the assist of a normality visual prompt to further improve the +matching accuracy of video event description text and video frames. Then, we +design a pseudo-label generation module based on the normality guidance to +infer reliable frame-level pseudo-labels. Finally, we introduce a temporal +context self-adaptive learning module to learn the temporal dependencies of +different video events more flexibly and accurately. Extensive experiments show +that our method achieves state-of-the-art performance on two benchmark +datasets, UCF-Crime and XD-Viole",cs.CV,['cs.CV'] +Partial-to-Partial Shape Matching with Geometric Consistency,Viktoria Ehm · Maolin Gao · Paul Roetzer · Marvin Eisenberger · Daniel Cremers · Florian Bernard,https://vikiehm.github.io/publications/gcppsm/,https://arxiv.org/abs/2404.12209,,2404.12209.pdf,Partial-to-Partial Shape Matching with Geometric Consistency,"Finding correspondences between 3D shapes is an important and long-standing +problem in computer vision, graphics and beyond. A prominent challenge are +partial-to-partial shape matching settings, which occur when the shapes to +match are only observed incompletely (e.g. from 3D scanning). Although +partial-to-partial matching is a highly relevant setting in practice, it is +rarely explored. Our work bridges the gap between existing (rather artificial) +3D full shape matching and partial-to-partial real-world settings by exploiting +geometric consistency as a strong constraint. We demonstrate that it is indeed +possible to solve this challenging problem in a variety of settings. For the +first time, we achieve geometric consistency for partial-to-partial matching, +which is realized by a novel integer non-linear program formalism building on +triangle product spaces, along with a new pruning algorithm based on linear +integer programming. Further, we generate a new inter-class dataset for +partial-to-partial shape-matching. We show that our method outperforms current +SOTA methods on both an established intra-class dataset and our novel +inter-class dataset.",cs.CV,['cs.CV'] +Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss,Jaeha Kim · Junghun Oh · Kyoung Mu Lee, ,https://arxiv.org/abs/2404.01692,,2404.01692.pdf,Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss,"In real-world scenarios, image recognition tasks, such as semantic +segmentation and object detection, often pose greater challenges due to the +lack of information available within low-resolution (LR) content. Image +super-resolution (SR) is one of the promising solutions for addressing the +challenges. However, due to the ill-posed property of SR, it is challenging for +typical SR methods to restore task-relevant high-frequency contents, which may +dilute the advantage of utilizing the SR method. Therefore, in this paper, we +propose Super-Resolution for Image Recognition (SR4IR) that effectively guides +the generation of SR images beneficial to achieving satisfactory image +recognition performance when processing LR images. The critical component of +our SR4IR is the task-driven perceptual (TDP) loss that enables the SR network +to acquire task-specific knowledge from a network tailored for a specific task. +Moreover, we propose a cross-quality patch mix and an alternate training +framework that significantly enhances the efficacy of the TDP loss by +addressing potential problems when employing the TDP loss. Through extensive +experiments, we demonstrate that our SR4IR achieves outstanding task +performance by generating SR images useful for a specific image recognition +task, including semantic segmentation, object detection, and image +classification. The implementation code is available at +https://github.com/JaehaKim97/SR4IR.",cs.CV,['cs.CV'] +FaceCom: Towards High-fidelity 3D Facial Shape Completion via Optimization and Inpainting Guidance,Yinglong Li · Hongyu Wu · Wang · Qingzhao Qin · yijiao zhao · Yong Wang · Aimin Hao, ,https://arxiv.org/abs/2308.16758,,2308.16758.pdf,Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images,"Generating 3D faces from textual descriptions has a multitude of +applications, such as gaming, movie, and robotics. Recent progresses have +demonstrated the success of unconditional 3D face generation and text-to-3D +shape generation. However, due to the limited text-3D face data pairs, +text-driven 3D face generation remains an open problem. In this paper, we +propose a text-guided 3D faces generation method, refer as TG-3DFace, for +generating realistic 3D faces using text guidance. Specifically, we adopt an +unconditional 3D face generation framework and equip it with text conditions, +which learns the text-guided 3D face generation with only text-2D face data. On +top of that, we propose two text-to-face cross-modal alignment techniques, +including the global contrastive learning and the fine-grained alignment +module, to facilitate high semantic consistency between generated 3D faces and +input texts. Besides, we present directional classifier guidance during the +inference process, which encourages creativity for out-of-domain generations. +Compared to the existing methods, TG-3DFace creates more realistic and +aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over +Latent3D. The rendered face images generated by TG-3DFace achieve higher FID +and CLIP score than text-to-2D face/image generation models, demonstrating our +superiority in generating realistic and semantic-consistent textures.",cs.CV,['cs.CV'] +4K4D: Real-Time 4D View Synthesis at 4K Resolution,Zhen Xu · Sida Peng · Haotong Lin · Guangzhao He · Jiaming Sun · Yujun Shen · Hujun Bao · Xiaowei Zhou,https://zju3dv.github.io/4k4d,https://arxiv.org/abs/2310.11448,,2310.11448.pdf,4K4D: Real-Time 4D View Synthesis at 4K Resolution,"This paper targets high-fidelity and real-time view synthesis of dynamic 3D +scenes at 4K resolution. Recently, some methods on dynamic view synthesis have +shown impressive rendering quality. However, their speed is still limited when +rendering high-resolution images. To overcome this problem, we propose 4K4D, a +4D point cloud representation that supports hardware rasterization and enables +unprecedented rendering speed. Our representation is built on a 4D feature grid +so that the points are naturally regularized and can be robustly optimized. In +addition, we design a novel hybrid appearance model that significantly boosts +the rendering quality while preserving efficiency. Moreover, we develop a +differentiable depth peeling algorithm to effectively learn the proposed model +from RGB videos. Experiments show that our representation can be rendered at +over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the +ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x +faster than previous methods and achieves the state-of-the-art rendering +quality. Our project page is available at https://zju3dv.github.io/4k4d/.",cs.CV,['cs.CV'] +VILA: On Pre-training for Visual Language Models,Ji Lin · Danny Yin · Wei Ping · Pavlo Molchanov · Mohammad Shoeybi · Song Han,https://github.com/NVlabs/VILA,https://arxiv.org/abs/2312.07533,,,VILA: On Pre-training for Visual Language Models,"Visual language models (VLMs) rapidly progressed with the recent success of +large language models. There have been growing efforts on visual instruction +tuning to extend the LLM with visual inputs, but lacks an in-depth study of the +visual language pre-training process, where the model learns to perform joint +modeling on both modalities. In this work, we examine the design options for +VLM pre-training by augmenting LLM towards VLM through step-by-step +controllable comparisons. We introduce three main findings: (1) freezing LLMs +during pre-training can achieve decent zero-shot performance, but lack +in-context learning capability, which requires unfreezing the LLM; (2) +interleaved pre-training data is beneficial whereas image-text pairs alone are +not optimal; (3) re-blending text-only instruction data to image-text data +during instruction fine-tuning not only remedies the degradation of text-only +tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe +we build VILA, a Visual Language model family that consistently outperforms the +state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells +and whistles. Multi-modal pre-training also helps unveil appealing properties +of VILA, including multi-image reasoning, enhanced in-context learning, and +better world knowledge.",cs.CV,['cs.CV'] +GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting,Chi Yan · Delin Qu · Dong Wang · Dan Xu · Zhigang Wang · Bin Zhao · Xuelong Li,https://gs-slam.github.io/,https://arxiv.org/abs/2311.11700,,2311.11700.pdf,GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting,"In this paper, we introduce \textbf{GS-SLAM} that first utilizes 3D Gaussian +representation in the Simultaneous Localization and Mapping (SLAM) system. It +facilitates a better balance between efficiency and accuracy. Compared to +recent SLAM methods employing neural implicit representations, our method +utilizes a real-time differentiable splatting rendering pipeline that offers +significant speedup to map optimization and RGB-D rendering. Specifically, we +propose an adaptive expansion strategy that adds new or deletes noisy 3D +Gaussians in order to efficiently reconstruct new observed scene geometry and +improve the mapping of previously observed areas. This strategy is essential to +extend 3D Gaussian representation to reconstruct the whole scene rather than +synthesize a static object in existing methods. Moreover, in the pose tracking +process, an effective coarse-to-fine technique is designed to select reliable +3D Gaussian representations to optimize camera pose, resulting in runtime +reduction and robust estimation. Our method achieves competitive performance +compared with existing state-of-the-art real-time methods on the Replica, +TUM-RGBD datasets. Project page: https://gs-slam.github.io/.",cs.CV,['cs.CV'] +Generating Content for HDR Deghosting from Frequency View,Tao Hu · Qingsen Yan · Yuankai Qi · Yanning Zhang, ,https://arxiv.org/abs/2404.00849,,2404.00849.pdf,Generating Content for HDR Deghosting from Frequency View,"Recovering ghost-free High Dynamic Range (HDR) images from multiple Low +Dynamic Range (LDR) images becomes challenging when the LDR images exhibit +saturation and significant motion. Recent Diffusion Models (DMs) have been +introduced in HDR imaging field, demonstrating promising performance, +particularly in achieving visually perceptible results compared to previous +DNN-based methods. However, DMs require extensive iterations with large models +to estimate entire images, resulting in inefficiency that hinders their +practical application. To address this challenge, we propose the Low-Frequency +aware Diffusion (LF-Diff) model for ghost-free HDR imaging. The key idea of +LF-Diff is implementing the DMs in a highly compacted latent space and +integrating it into a regression-based model to enhance the details of +reconstructed images. Specifically, as low-frequency information is closely +related to human visual perception we propose to utilize DMs to create compact +low-frequency priors for the reconstruction process. In addition, to take full +advantage of the above low-frequency priors, the Dynamic HDR Reconstruction +Network (DHRNet) is carried out in a regression-based manner to obtain final +HDR images. Extensive experiments conducted on synthetic and real-world +benchmark datasets demonstrate that our LF-Diff performs favorably against +several state-of-the-art methods and is 10$\times$ faster than previous +DM-based methods.",cs.CV,['cs.CV'] +Neural Sign Actors: A diffusion model for 3D sign language production from text,Vasileios Baltatzis · Rolandos Alexandros Potamias · Evangelos Ververas · Guanxiong Sun · Jiankang Deng · Stefanos Zafeiriou, ,https://arxiv.org/abs/2312.02702,,2312.02702.pdf,Neural Sign Actors: A diffusion model for 3D sign language production from text,"Sign Languages (SL) serve as the primary mode of communication for the Deaf +and Hard of Hearing communities. Deep learning methods for SL recognition and +translation have achieved promising results. However, Sign Language Production +(SLP) poses a challenge as the generated motions must be realistic and have +precise semantic meaning. Most SLP methods rely on 2D data, which hinders their +realism. In this work, a diffusion-based SLP model is trained on a curated +large-scale dataset of 4D signing avatars and their corresponding text +transcripts. The proposed method can generate dynamic sequences of 3D avatars +from an unconstrained domain of discourse using a diffusion process formed on a +novel and anatomically informed graph neural network defined on the SMPL-X body +skeleton. Through quantitative and qualitative experiments, we show that the +proposed method considerably outperforms previous methods of SLP. This work +makes an important step towards realistic neural sign avatars, bridging the +communication gap between Deaf and hearing communities.",cs.CV,['cs.CV'] +Steerers: A framework for rotation equivariant keypoint descriptors,Georg Bökman · Johan Edstedt · Michael Felsberg · Fredrik Kahl, ,https://arxiv.org/abs/2312.02152,,2312.02152.pdf,Steerers: A framework for rotation equivariant keypoint descriptors,"Image keypoint descriptions that are discriminative and matchable over large +changes in viewpoint are vital for 3D reconstruction. However, descriptions +output by learned descriptors are typically not robust to camera rotation. +While they can be made more robust by, e.g., data augmentation, this degrades +performance on upright images. Another approach is test-time augmentation, +which incurs a significant increase in runtime. Instead, we learn a linear +transform in description space that encodes rotations of the input image. We +call this linear transform a steerer since it allows us to transform the +descriptions as if the image was rotated. From representation theory, we know +all possible steerers for the rotation group. Steerers can be optimized (A) +given a fixed descriptor, (B) jointly with a descriptor or (C) we can optimize +a descriptor given a fixed steerer. We perform experiments in these three +settings and obtain state-of-the-art results on the rotation invariant image +matching benchmarks AIMS and Roto-360. We publish code and model weights at +https://github.com/georg-bn/rotation-steerers.",cs.CV,['cs.CV'] +LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning,Siyuan Cheng · Guanhong Tao · Yingqi Liu · Guangyu Shen · Shengwei An · Shiwei Feng · Xiangzhe Xu · Kaiyuan Zhang · Shiqing Ma · Xiangyu Zhang,https://github.com/Megum1/LOTUS,https://arxiv.org/abs/2403.17188,,2403.17188.pdf,LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning,"Backdoor attack poses a significant security threat to Deep Learning +applications. Existing attacks are often not evasive to established backdoor +detection techniques. This susceptibility primarily stems from the fact that +these attacks typically leverage a universal trigger pattern or transformation +function, such that the trigger can cause misclassification for any input. In +response to this, recent papers have introduced attacks using sample-specific +invisible triggers crafted through special transformation functions. While +these approaches manage to evade detection to some extent, they reveal +vulnerability to existing backdoor mitigation techniques. To address and +enhance both evasiveness and resilience, we introduce a novel backdoor attack +LOTUS. Specifically, it leverages a secret function to separate samples in the +victim class into a set of partitions and applies unique triggers to different +partitions. Furthermore, LOTUS incorporates an effective trigger focusing +mechanism, ensuring only the trigger corresponding to the partition can induce +the backdoor behavior. Extensive experimental results show that LOTUS can +achieve high attack success rate across 4 datasets and 7 model structures, and +effectively evading 13 backdoor detection and mitigation techniques. The code +is available at https://github.com/Megum1/LOTUS.",cs.CV,"['cs.CV', 'cs.CR']" +Language-only Training of Zero-shot Composed Image Retrieval,Geonmo Gu · Sanghyuk Chun · Wonjae Kim · Yoohoon Kang · Sangdoo Yun,https://github.com/navervision/lincir,https://arxiv.org/abs/2312.01998,,2312.01998.pdf,Language-only Efficient Training of Zero-shot Composed Image Retrieval,"Composed image retrieval (CIR) task takes a composed query of image and text, +aiming to search relative images for both conditions. Conventional CIR +approaches need a training dataset composed of triplets of query image, query +text, and target image, which is very expensive to collect. Several recent +works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue +without using pre-collected triplets. However, the existing ZS-CIR methods show +limited backbone scalability and generalizability due to the lack of diversity +of the input texts during training. We propose a novel CIR framework, only +using language for its training. Our LinCIR (Language-only training for CIR) +can be trained only with text datasets by a novel self-supervision named +self-masking projection (SMP). We project the text latent embedding to the +token embedding space and construct a new text by replacing the keyword tokens +of the original text. Then, we let the new and original texts have the same +latent embedding vector. With this simple strategy, LinCIR is surprisingly +efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in +48 minutes and shows the best ZS-CIR performances on four different CIR +benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised +method on FashionIQ. Code is available at https://github.com/navervision/lincir",cs.CV,"['cs.CV', 'cs.IR']" +"""Previously on ..."" From Recaps to Story Summarization",Aditya Kumar Singh · Dhruv Srivastava · Makarand Tapaswi, ,https://arxiv.org/abs/2405.11487,,2405.11487.pdf,"""Previously on ..."" From Recaps to Story Summarization","We introduce multimodal story summarization by leveraging TV episode recaps - +short video sequences interweaving key story moments from previous episodes to +bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime +thriller TV shows with rich recaps and long episodes of 40 minutes. Story +summarization labels are unlocked by matching recap shots to corresponding +sub-stories in the episode. We propose a hierarchical model TaleSumm that +processes entire episodes by creating compact shot and dialog representations, +and predicts importance scores for each video shot and dialog utterance by +enabling interactions between local story groups. Unlike traditional +summarization, our method extracts multiple plot points from long videos. We +present a thorough evaluation on story summarization, including promising +cross-series generalization. TaleSumm also shows good results on classic video +summarization benchmarks.",cs.CV,['cs.CV'] +Learning Equi-angular Representations for Online Continual Learning,Minhyuk Seo · Hyunseo Koh · Wonje Jeung · Minjae Lee · San Kim · Hankook Lee · Sungjun Cho · Sungik Choi · Hyunwoo Kim · Jonghyun Choi, ,https://arxiv.org/abs/2404.01628,,2404.01628.pdf,Learning Equi-angular Representations for Online Continual Learning,"Online continual learning suffers from an underfitted solution due to +insufficient training for prompt model update (e.g., single-epoch training). To +address the challenge, we propose an efficient online continual learning method +using the neural collapse phenomenon. In particular, we induce neural collapse +to form a simplex equiangular tight frame (ETF) structure in the representation +space so that the continuously learned model with a single epoch can better fit +to the streamed data by proposing preparatory data training and residual +correction in the representation space. With an extensive set of empirical +validations using CIFAR-10/100, TinyImageNet, ImageNet-200, and ImageNet-1K, we +show that our proposed method outperforms state-of-the-art methods by a +noticeable margin in various online continual learning scenarios such as +disjoint and Gaussian scheduled continuous (i.e., boundary-free) data setups.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Holodeck: Language Guided Generation of 3D Embodied AI Environments,Yue Yang · Fan-Yun Sun · Luca Weihs · Eli VanderBilt · Alvaro Herrasti · Winson Han · Jiajun Wu · Nick Haber · Ranjay Krishna · Lingjie Liu · Chris Callison-Burch · Mark Yatskar · Aniruddha Kembhavi · Christopher Clark,https://yueyang1996.github.io/holodeck/,https://arxiv.org/abs/2312.09067,,2312.09067.pdf,Holodeck: Language Guided Generation of 3D Embodied AI Environments,"3D simulated environments play a critical role in Embodied AI, but their +creation requires expertise and extensive manual effort, restricting their +diversity and scope. To mitigate this limitation, we present Holodeck, a system +that generates 3D environments to match a user-supplied prompt fully +automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and +museums, adjust the designs for styles, and can capture the semantics of +complex queries such as ""apartment for a researcher with a cat"" and ""office of +a professor who is a fan of Star Wars"". Holodeck leverages a large language +model (i.e., GPT-4) for common sense knowledge about what the scene might look +like and uses a large collection of 3D assets from Objaverse to populate the +scene with diverse objects. To address the challenge of positioning objects +correctly, we prompt GPT-4 to generate spatial relational constraints between +objects and then optimize the layout to satisfy those constraints. Our +large-scale human evaluation shows that annotators prefer Holodeck over +manually designed procedural baselines in residential scenes and that Holodeck +can produce high-quality outputs for diverse scene types. We also demonstrate +an exciting application of Holodeck in Embodied AI, training agents to navigate +in novel scenes like music rooms and daycares without human-constructed data, +which is a significant step forward in developing general-purpose embodied +agents.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.RO']" +NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation,Jiahao Chen · Yipeng Qin · Lingjie Liu · Jiangbo Lu · Guanbin Li, ,https://arxiv.org/abs/2403.17537,,2403.17537.pdf,NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation,"Neural Radiance Field (NeRF) has been widely recognized for its excellence in +novel view synthesis and 3D scene reconstruction. However, their effectiveness +is inherently tied to the assumption of static scenes, rendering them +susceptible to undesirable artifacts when confronted with transient distractors +such as moving objects or shadows. In this work, we propose a novel paradigm, +namely ""Heuristics-Guided Segmentation"" (HuGS), which significantly enhances +the separation of static scenes from transient distractors by harmoniously +combining the strengths of hand-crafted heuristics and state-of-the-art +segmentation models, thus significantly transcending the limitations of +previous solutions. Furthermore, we delve into the meticulous design of +heuristics, introducing a seamless fusion of Structure-from-Motion (SfM)-based +heuristics and color residual heuristics, catering to a diverse range of +texture profiles. Extensive experiments demonstrate the superiority and +robustness of our method in mitigating transient distractors for NeRFs trained +in non-static scenes. Project page: https://cnhaox.github.io/NeRF-HuGS/.",cs.CV,['cs.CV'] +FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning,Junyuan Zhang · Shuang Zeng · Miao Zhang · Runxi Wang · Feifei Wang · Yuyin Zhou · Paul Pu Liang · Liangqiong Qu,https://carkham.github.io/FL_Het_Bench/,https://arxiv.org/abs/2306.05172,,2306.05172.pdf,FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems,"Federated Machine Learning (FL) has received considerable attention in recent +years. FL benchmarks are predominantly explored in either simulated systems or +data center environments, neglecting the setups of real-world systems, which +are often closely linked to edge computing. We close this research gap by +introducing FLEdge, a benchmark targeting FL workloads in edge computing +systems. We systematically study hardware heterogeneity, energy efficiency +during training, and the effect of various differential privacy levels on +training in FL systems. To make this benchmark applicable to real-world +scenarios, we evaluate the impact of client dropouts on state-of-the-art FL +strategies with failure rates as high as 50%. FLEdge provides new insights, +such as that training state-of-the-art FL workloads on older GPU-accelerated +embedded devices is up to 3x more energy efficient than on modern server-grade +GPUs.",cs.LG,"['cs.LG', 'cs.DC', 'I.2.11; C.2.4; C.4; D.2.8']" +Hyper-MD: Mesh Denoising with Customized Parameters Aware of Noise Intensity and Geometric Characteristics,Xingtao Wang · Hongliang Wei · Xiaopeng Fan · Debin Zhao, ,https://arxiv.org/abs/2405.06536,,2405.06536.pdf,Mesh Denoising Transformer,"Mesh denoising, aimed at removing noise from input meshes while preserving +their feature structures, is a practical yet challenging task. Despite the +remarkable progress in learning-based mesh denoising methodologies in recent +years, their network designs often encounter two principal drawbacks: a +dependence on single-modal geometric representations, which fall short in +capturing the multifaceted attributes of meshes, and a lack of effective global +feature aggregation, hindering their ability to fully understand the mesh's +comprehensive structure. To tackle these issues, we propose SurfaceFormer, a +pioneering Transformer-based mesh denoising framework. Our first contribution +is the development of a new representation known as Local Surface Descriptor, +which is crafted by establishing polar systems on each mesh face, followed by +sampling points from adjacent surfaces using geodesics. The normals of these +points are organized into 2D patches, mimicking images to capture local +geometric intricacies, whereas the poles and vertex coordinates are +consolidated into a point cloud to embody spatial information. This advancement +surmounts the hurdles posed by the irregular and non-Euclidean characteristics +of mesh data, facilitating a smooth integration with Transformer architecture. +Next, we propose a dual-stream structure consisting of a Geometric Encoder +branch and a Spatial Encoder branch, which jointly encode local geometry +details and spatial information to fully explore multimodal information for +mesh denoising. A subsequent Denoising Transformer module receives the +multimodal information and achieves efficient global feature aggregation +through self-attention operators. Our experimental evaluations demonstrate that +this novel approach outperforms existing state-of-the-art methods in both +objective and subjective assessments, marking a significant leap forward in +mesh denoising.",cs.CV,['cs.CV'] +Boosting Diffusion Models with Moving Average Sampling in Frequency Domain,Yurui Qian · Qi Cai · Yingwei Pan · Yehao Li · Ting Yao · Qibin Sun · Tao Mei, ,https://arxiv.org/abs/2403.17870,,2403.17870.pdf,Boosting Diffusion Models with Moving Average Sampling in Frequency Domain,"Diffusion models have recently brought a powerful revolution in image +generation. Despite showing impressive generative capabilities, most of these +models rely on the current sample to denoise the next one, possibly resulting +in denoising instability. In this paper, we reinterpret the iterative denoising +process as model optimization and leverage a moving average mechanism to +ensemble all the prior samples. Instead of simply applying moving average to +the denoised samples at different timesteps, we first map the denoised samples +to data space and then perform moving average to avoid distribution shift +across timesteps. In view that diffusion models evolve the recovery from +low-frequency components to high-frequency details, we further decompose the +samples into different frequency components and execute moving average +separately on each component. We name the complete approach ""Moving Average +Sampling in Frequency domain (MASF)"". MASF could be seamlessly integrated into +mainstream pre-trained diffusion models and sampling schedules. Extensive +experiments on both unconditional and conditional diffusion models demonstrate +that our MASF leads to superior performances compared to the baselines, with +almost negligible additional complexity cost.",cs.CV,"['cs.CV', 'cs.MM']" +Task-Aware Encoder Control for Deep Video Compression,Xingtong Ge · Jixiang Luo · XINJIE ZHANG · Tongda Xu · Guo Lu · Dailan He · Jing Geng · Yan Wang · Jun Zhang · Hongwei Qin, ,https://arxiv.org/abs/2404.04848,,2404.04848.pdf,Task-Aware Encoder Control for Deep Video Compression,"Prior research on deep video compression (DVC) for machine tasks typically +necessitates training a unique codec for each specific task, mandating a +dedicated decoder per task. In contrast, traditional video codecs employ a +flexible encoder controller, enabling the adaptation of a single codec to +different tasks through mechanisms like mode prediction. Drawing inspiration +from this, we introduce an innovative encoder controller for deep video +compression for machines. This controller features a mode prediction and a +Group of Pictures (GoP) selection module. Our approach centralizes control at +the encoding stage, allowing for adaptable encoder adjustments across different +tasks, such as detection and tracking, while maintaining compatibility with a +standard pre-trained DVC decoder. Empirical evidence demonstrates that our +method is applicable across multiple tasks with various existing pre-trained +DVCs. Moreover, extensive experiments demonstrate that our method outperforms +previous DVC by about 25% bitrate for different tasks, with only one +pre-trained decoder.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV']" +NEAT: Distilling 3D Wireframes from Neural Attraction Fields,Nan Xue · Bin Tan · Yuxi Xiao · Liang Dong · Gui-Song Xia · Tianfu Wu · Yujun Shen,https://github.com/cherubicXN/neat,https://arxiv.org/abs/2307.10206,,2307.10206.pdf,NEAT: Distilling 3D Wireframes from Neural Attraction Fields,"This paper studies the problem of structured 3D reconstruction using +wireframes that consist of line segments and junctions, focusing on the +computation of structured boundary geometries of scenes. Instead of leveraging +matching-based solutions from 2D wireframes (or line segments) for 3D wireframe +reconstruction as done in prior arts, we present NEAT, a rendering-distilling +formulation using neural fields to represent 3D line segments with 2D +observations, and bipartite matching for perceiving and distilling of a sparse +set of 3D global junctions. The proposed {NEAT} enjoys the joint optimization +of the neural fields and the global junctions from scratch, using +view-dependent 2D observations without precomputed cross-view feature matching. +Comprehensive experiments on the DTU and BlendedMVS datasets demonstrate our +NEAT's superiority over state-of-the-art alternatives for 3D wireframe +reconstruction. Moreover, the distilled 3D global junctions by NEAT, are a +better initialization than SfM points, for the recently-emerged 3D Gaussian +Splatting for high-fidelity novel view synthesis using about 20 times fewer +initial 3D points. Project page: \url{https://xuenan.net/neat}.",cs.CV,"['cs.CV', 'cs.GR']" +Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval,Jiamian Wang · Guohao Sun · Pichao Wang · Dongfang Liu · Sohail Dianat · MAJID RABBANI · Raghuveer Rao · ZHIQIANG TAO, ,https://arxiv.org/abs/2403.17998,,2403.17998.pdf,Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval,"The increasing prevalence of video clips has sparked growing interest in +text-video retrieval. Recent advances focus on establishing a joint embedding +space for text and video, relying on consistent embedding representations to +compute similarity. However, the text content in existing datasets is generally +short and concise, making it hard to fully describe the redundant semantics of +a video. Correspondingly, a single text embedding may be less expressive to +capture the video embedding and empower the retrieval. In this study, we +propose a new stochastic text modeling method T-MASS, i.e., text is modeled as +a stochastic embedding, to enrich text embedding with a flexible and resilient +semantic range, yielding a text mass. To be specific, we introduce a +similarity-aware radius module to adapt the scale of the text mass upon the +given text-video pairs. Plus, we design and develop a support text +regularization to further control the text mass during the training. The +inference pipeline is also tailored to fully exploit the text mass for accurate +retrieval. Empirical evidence suggests that T-MASS not only effectively +attracts relevant text-video pairs while distancing irrelevant ones, but also +enables the determination of precise text embeddings for relevant pairs. Our +experimental results show a substantial improvement of T-MASS over baseline (3% +to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five +benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.",cs.CV,['cs.CV'] +Optimizing Diffusion Noise Can Serve As Universal Motion Priors,Korrawe Karunratanakul · Konpat Preechakul · Emre Aksan · Thabo Beeler · Supasorn Suwajanakorn · Siyu Tang,https://korrawe.github.io/dno-project/,https://arxiv.org/abs/2312.11994v1,,2312.11994v1.pdf,Optimizing Diffusion Noise Can Serve As Universal Motion Priors,"We propose Diffusion Noise Optimization (DNO), a new method that effectively +leverages existing motion diffusion models as motion priors for a wide range of +motion-related tasks. Instead of training a task-specific diffusion model for +each new task, DNO operates by optimizing the diffusion latent noise of an +existing pre-trained text-to-motion model. Given the corresponding latent noise +of a human motion, it propagates the gradient from the target criteria defined +on the motion space through the whole denoising process to update the diffusion +latent noise. As a result, DNO supports any use cases where criteria can be +defined as a function of motion. In particular, we show that, for motion +editing and control, DNO outperforms existing methods in both achieving the +objective and preserving the motion content. DNO accommodates a diverse range +of editing modes, including changing trajectory, pose, joint locations, or +avoiding newly added obstacles. In addition, DNO is effective in motion +denoising and completion, producing smooth and realistic motion from noisy and +partial inputs. DNO achieves these results at inference time without the need +for model retraining, offering great versatility for any defined reward or loss +function on the motion representation.",cs.CV,['cs.CV'] +Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation,Ziyang Chen · Yongsheng Pan · Yiwen Ye · Mengkang Lu · Yong Xia,https://github.com/Chen-Ziyang/VPTTA,https://arxiv.org/abs/2311.18363,,2311.18363.pdf,Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation,"Distribution shift widely exists in medical images acquired from different +medical centres and poses a significant obstacle to deploying the pre-trained +semantic segmentation model in real-world applications. Test-time adaptation +has proven its effectiveness in tackling the cross-domain distribution shift +during inference. However, most existing methods achieve adaptation by updating +the pre-trained models, rendering them susceptible to error accumulation and +catastrophic forgetting when encountering a series of distribution shifts +(i.e., under the continual test-time adaptation setup). To overcome these +challenges caused by updating the models, in this paper, we freeze the +pre-trained model and propose the Visual Prompt-based Test-Time Adaptation +(VPTTA) method to train a specific prompt for each test image to align the +statistics in the batch normalization layers. Specifically, we present the +low-frequency prompt, which is lightweight with only a few parameters and can +be effectively trained in a single iteration. To enhance prompt initialization, +we equip VPTTA with a memory bank to benefit the current prompt from previous +ones. Additionally, we design a warm-up mechanism, which mixes source and +target statistics to construct warm-up statistics, thereby facilitating the +training process. Extensive experiments demonstrate the superiority of our +VPTTA over other state-of-the-art methods on two medical image segmentation +benchmark tasks. The code and weights of pre-trained source models are +available at https://github.com/Chen-Ziyang/VPTTA.",cs.CV,['cs.CV'] +A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning,Xiaoyang Xu · Mengda Yang · Wenzhe Yi · Ziang Li · Juan Wang · Hongxin Hu · Yong ZHUANG · Yaxin Liu, ,https://arxiv.org/abs/2405.04115,,2405.04115.pdf,A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning,"Split Learning (SL) is a distributed learning framework renowned for its +privacy-preserving features and minimal computational requirements. Previous +research consistently highlights the potential privacy breaches in SL systems +by server adversaries reconstructing training data. However, these studies +often rely on strong assumptions or compromise system utility to enhance attack +performance. This paper introduces a new semi-honest Data Reconstruction Attack +on SL, named Feature-Oriented Reconstruction Attack (FORA). In contrast to +prior works, FORA relies on limited prior knowledge, specifically that the +server utilizes auxiliary samples from the public without knowing any client's +private information. This allows FORA to conduct the attack stealthily and +achieve robust performance. The key vulnerability exploited by FORA is the +revelation of the model representation preference in the smashed data output by +victim client. FORA constructs a substitute client through feature-level +transfer learning, aiming to closely mimic the victim client's representation +preference. Leveraging this substitute client, the server trains the attack +model to effectively reconstruct private data. Extensive experiments showcase +FORA's superior performance compared to state-of-the-art methods. Furthermore, +the paper systematically evaluates the proposed method's applicability across +diverse settings and advanced defense strategies.",cs.CR,['cs.CR'] +Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models,David Stotko · Nils Wandel · Reinhard Klein,https://cg.cs.uni-bonn.de/publication/stotko2024-Physics-guided-SfT,https://arxiv.org/abs/2311.12796,,2311.12796.pdf,Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models,"3D reconstruction of dynamic scenes is a long-standing problem in computer +graphics and increasingly difficult the less information is available. +Shape-from-Template (SfT) methods aim to reconstruct a template-based geometry +from RGB images or video sequences, often leveraging just a single monocular +camera without depth information, such as regular smartphone recordings. +Unfortunately, existing reconstruction methods are either unphysical and noisy +or slow in optimization. To solve this problem, we propose a novel SfT +reconstruction algorithm for cloth using a pre-trained neural surrogate model +that is fast to evaluate, stable, and produces smooth reconstructions due to a +regularizing physics simulation. Differentiable rendering of the simulated mesh +enables pixel-wise comparisons between the reconstruction and a target video +sequence that can be used for a gradient-based optimization procedure to +extract not only shape information but also physical parameters such as +stretching, shearing, or bending stiffness of the cloth. This allows to retain +a precise, stable, and smooth reconstructed geometry while reducing the runtime +by a factor of 400-500 compared to $\phi$-SfT, a state-of-the-art physics-based +SfT approach.",cs.CV,"['cs.CV', 'cs.LG']" +Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition,Anqi Zhu · Qiuhong Ke · Mingming Gong · James Bailey, ,https://arxiv.org/abs/2404.07487,,2404.07487.pdf,Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition,"Skeleton-based zero-shot action recognition aims to recognize unknown human +actions based on the learned priors of the known skeleton-based actions and a +semantic descriptor space shared by both known and unknown categories. However, +previous works focus on establishing the bridges between the known skeleton +representation space and semantic descriptions space at the coarse-grained +level for recognizing unknown action categories, ignoring the fine-grained +alignment of these two spaces, resulting in suboptimal performance in +distinguishing high-similarity action categories. To address these challenges, +we propose a novel method via Side information and dual-prompts learning for +skeleton-based zero-shot action recognition (STAR) at the fine-grained level. +Specifically, 1) we decompose the skeleton into several parts based on its +topology structure and introduce the side information concerning multi-part +descriptions of human body movements for alignment between the skeleton and the +semantic space at the fine-grained level; 2) we design the visual-attribute and +semantic-part prompts to improve the intra-class compactness within the +skeleton space and inter-class separability within the semantic space, +respectively, to distinguish the high-similarity actions. Extensive experiments +show that our method achieves state-of-the-art performance in ZSL and GZSL +settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.",cs.CV,['cs.CV'] +MICap: A Unified Model for Identity-aware Movie Descriptions,Haran Raajesh · Naveen Reddy Desanur · Zeeshan Khan · Makarand Tapaswi, ,https://arxiv.org/abs/2405.11483,,2405.11483.pdf,MICap: A Unified Model for Identity-aware Movie Descriptions,"Characters are an important aspect of any storyline and identifying and +including them in descriptions is necessary for story understanding. While +previous work has largely ignored identity and generated captions with someone +(anonymized names), recent work formulates id-aware captioning as a +fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is +to predict person id labels. However, to predict captions with ids, a two-stage +approach is required: first predict captions with someone, then fill in +identities. In this work, we present a new single stage approach that can +seamlessly switch between id-aware caption generation or FITB when given a +caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared +auto-regressive decoder that benefits from training with FITB and full-caption +generation objectives, while the encoder can benefit from or disregard captions +with blanks as input. Another challenge with id-aware captioning is the lack of +a metric to capture subtle differences between person ids. To this end, we +introduce iSPICE, a caption evaluation metric that focuses on identity tuples +created through intermediate scene graphs. We evaluate MICap on Large-Scale +Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB +accuracy, and a 1-2% bump in classic captioning metrics.",cs.CV,['cs.CV'] +DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection,Lewei Yao · Renjie Pi · Jianhua Han · Xiaodan Liang · Hang Xu · Wei Zhang · Zhenguo Li · Dan Xu, ,https://arxiv.org/abs/2404.09216,,,DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection,"Existing open-vocabulary object detectors typically require a predefined set +of categories from users, significantly confining their application scenarios. +In this paper, we introduce DetCLIPv3, a high-performing detector that excels +not only at both open-vocabulary object detection, but also generating +hierarchical labels for detected objects. DetCLIPv3 is characterized by three +core designs: 1. Versatile model architecture: we derive a robust open-set +detection framework which is further empowered with generation ability via the +integration of a caption head. 2. High information density data: we develop an +auto-annotation pipeline leveraging visual large language model to refine +captions for large-scale image-text pairs, providing rich, multi-granular +object labels to enhance the training. 3. Efficient training strategy: we +employ a pre-training stage with low-resolution inputs that enables the object +captioner to efficiently learn a broad spectrum of visual concepts from +extensive image-text paired data. This is followed by a fine-tuning stage that +leverages a small number of high-resolution samples to further enhance +detection performance. With these effective designs, DetCLIPv3 demonstrates +superior open-vocabulary detection performance, \eg, our Swin-T backbone model +achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, +outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, +respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense +captioning task on VG dataset, showcasing its strong generative capability.",cs.CV,['cs.CV'] +Label Propagation for Zero-shot Classification with Vision-Language Models,Vladan Stojnić · Yannis Kalantidis · Giorgos Tolias,https://github.com/vladan-stojnic/ZLaP,https://arxiv.org/abs/2404.04072,,2404.04072.pdf,Label Propagation for Zero-shot Classification with Vision-Language Models,"Vision-Language Models (VLMs) have demonstrated impressive performance on +zero-shot classification, i.e. classification when provided merely with a list +of class names. In this paper, we tackle the case of zero-shot classification +in the presence of unlabeled data. We leverage the graph structure of the +unlabeled data and introduce ZLaP, a method based on label propagation (LP) +that utilizes geodesic distances for classification. We tailor LP to graphs +containing both text and image features and further propose an efficient method +for performing inductive inference based on a dual solution and a +sparsification step. We perform extensive experiments to evaluate the +effectiveness of our method on 14 common datasets and show that ZLaP +outperforms the latest related works. Code: +https://github.com/vladan-stojnic/ZLaP",cs.CV,"['cs.CV', 'cs.LG']" +KVQ: Kwai Video Quality Assessment for Short-form Videos,Yiting Lu · Xin Li · Yajing Pei · Kun Yuan · Qizhi Xie · Yunpeng Qu · Ming Sun · Chao Zhou · Zhibo Chen,https://github.com/lixinustc/KVQ-Challenge-CVPR-NTIRE2024,https://arxiv.org/abs/2402.07220,,2402.07220.pdf,KVQ: Kwai Video Quality Assessment for Short-form Videos,"Short-form UGC video platforms, like Kwai and TikTok, have been an emerging +and irreplaceable mainstream media form, thriving on user-friendly engagement, +and kaleidoscope creation, etc. However, the advancing content-generation +modes, e.g., special effects, and sophisticated processing workflows, e.g., +de-artifacts, have introduced significant challenges to recent UGC video +quality assessment: (i) the ambiguous contents hinder the identification of +quality-determined regions. (ii) the diverse and complicated hybrid distortions +are hard to distinguish. To tackle the above challenges and assist in the +development of short-form videos, we establish the first large-scale +Kaleidoscope short Video database for Quality assessment, termed KVQ, which +comprises 600 user-uploaded short videos and 3600 processed videos through the +diverse practical processing workflows, including pre-processing, transcoding, +and enhancement. Among them, the absolute quality score of each video and +partial ranking score among indistinguishable samples are provided by a team of +professional researchers specializing in image processing. Based on this +database, we propose the first short-form video quality evaluator, i.e., KSVQE, +which enables the quality evaluator to identify the quality-determined +semantics with the content understanding of large vision language models (i.e., +CLIP) and distinguish the distortions with the distortion understanding module. +Experimental results have shown the effectiveness of KSVQE on our KVQ database +and popular VQA databases.",eess.IV,"['eess.IV', 'cs.CV']" +StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation,Yining Shi · Kun JIANG · Ke Wang · Jiusi Li · Yunlong Wang · Mengmeng Yang · Diange Yang, ,,https://github.com/keithAND2020/awesome-Occupancy-research,,,,,nan +HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion,Jingbo Zhang · Xiaoyu Li · Qi Zhang · Yan-Pei Cao · Ying Shan · Jing Liao, ,https://arxiv.org/abs/2311.16961v1,,2311.16961v1.pdf,HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion,"Generating a 3D human model from a single reference image is challenging +because it requires inferring textures and geometries in invisible views while +maintaining consistency with the reference image. Previous methods utilizing 3D +generative models are limited by the availability of 3D training data. +Optimization-based methods that lift text-to-image diffusion models to 3D +generation often fail to preserve the texture details of the reference image, +resulting in inconsistent appearances in different views. In this paper, we +propose HumanRef, a 3D human generation framework from a single-view input. To +ensure the generated 3D model is photorealistic and consistent with the input +image, HumanRef introduces a novel method called reference-guided score +distillation sampling (Ref-SDS), which effectively incorporates image guidance +into the generation process. Furthermore, we introduce region-aware attention +to Ref-SDS, ensuring accurate correspondence between different body regions. +Experimental results demonstrate that HumanRef outperforms state-of-the-art +methods in generating 3D clothed humans with fine geometry, photorealistic +textures, and view-consistent appearances.",cs.CV,['cs.CV'] +SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,Xin Guo · Jiangwei Lao · Bo Dang · Yingying Zhang · Lei Yu · Lixiang Ru · Liheng Zhong · Ziyuan Huang · Kang Wu · Dingxiang Hu · HUIMEI HE · Jian Wang · Jingdong Chen · Ming Yang · Yongjun Zhang · Yansheng Li, ,https://arxiv.org/abs/2312.10115,,2312.10115.pdf,SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,"Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense +potential towards a generic model for Earth Observation. Nevertheless, these +works primarily focus on a single modality without temporal and geo-context +modeling, hampering their capabilities for diverse tasks. In this study, we +present SkySense, a generic billion-scale model, pre-trained on a curated +multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal +sequences. SkySense incorporates a factorized multi-modal spatiotemporal +encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) +data as input. This encoder is pre-trained by our proposed Multi-Granularity +Contrastive Learning to learn representations across different modal and +spatial granularities. To further enhance the RSI representations by the +geo-context clue, we introduce Geo-Context Prototype Learning to learn +region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our +best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules +can be flexibly combined or used individually to accommodate various tasks. It +demonstrates remarkable generalization capabilities on a thorough evaluation +encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to +temporal, and classification to localization. SkySense surpasses 18 recent +RSFMs in all test scenarios. Specifically, it outperforms the latest models +such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and +3.61% on average respectively. We will release the pre-trained weights to +facilitate future research and Earth Observation applications.",cs.CV,['cs.CV'] +BioCLIP: A Vision Foundation Model for the Tree of Life,Samuel Stevens · Jiaman Wu · Matthew Thompson · Elizabeth Campolongo · Chan Hee Song · David Carlyn · Li Dong · Wasila Dahdul · Charles Stewart · Tanya Berger-Wolf · Wei-Lun Chao · Yu Su, ,https://arxiv.org/abs/2311.18803,,2311.18803.pdf,BioCLIP: A Vision Foundation Model for the Tree of Life,"Images of the natural world, collected by a variety of cameras, from drones +to individual phones, are increasingly abundant sources of biological +information. There is an explosion of computational methods and tools, +particularly computer vision, for extracting biologically relevant information +from images for science and conservation. Yet most of these are bespoke +approaches designed for a specific task and are not easily adaptable or +extendable to new questions, contexts, and datasets. A vision model for general +organismal biology questions on images is of timely need. To approach this, we +curate and release TreeOfLife-10M, the largest and most diverse ML-ready +dataset of biology images. We then develop BioCLIP, a foundation model for the +tree of life, leveraging the unique properties of biology captured by +TreeOfLife-10M, namely the abundance and variety of images of plants, animals, +and fungi, together with the availability of rich structured biological +knowledge. We rigorously benchmark our approach on diverse fine-grained biology +classification tasks and find that BioCLIP consistently and substantially +outperforms existing baselines (by 16% to 17% absolute). Intrinsic evaluation +reveals that BioCLIP has learned a hierarchical representation conforming to +the tree of life, shedding light on its strong generalizability. +https://imageomics.github.io/bioclip has models, data and code.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +MAFA: Managing False Negatives for Vision-Language Pre-training,Jaeseok Byun · Dohoon Kim · Taesup Moon, ,https://arxiv.org/abs/2312.06112,,2312.06112.pdf,Converting and Smoothing False Negatives for Vision-Language Pre-training,"We consider the critical issue of false negatives in Vision-Language +Pre-training (VLP), a challenge that arises from the inherent many-to-many +correspondence of image-text pairs in large-scale web-crawled datasets. The +presence of false negatives can impede achieving optimal performance and even +lead to learning failures. To address this challenge, we propose a method +called COSMO (COnverting and SMOoothing false negatives) that manages the false +negative issues, especially powerful in hard negative sampling. Building upon +the recently developed GRouped mIni-baTch sampling (GRIT) strategy, our +approach consists of two pivotal components: 1) an efficient connection mining +process that identifies and converts false negatives into positives, and 2) +label smoothing for the image-text contrastive loss (ITC). Our comprehensive +experiments verify the effectiveness of COSMO across multiple downstream tasks, +emphasizing the crucial role of addressing false negatives in VLP, potentially +even surpassing the importance of addressing false positives. In addition, the +compatibility of COSMO with the recent BLIP-family model is also demonstrated.",cs.CV,"['cs.CV', 'cs.AI']" +General Object Foundation Model for Images and Videos at Scale,Junfeng Wu · Yi Jiang · Qihao Liu · Zehuan Yuan · Xiang Bai · Song Bai,https://glee-vision.github.io/,https://arxiv.org/abs/2312.09158,,2312.09158.pdf,General Object Foundation Model for Images and Videos at Scale,"We present GLEE in this work, an object-level foundation model for locating +and identifying objects in images and videos. Through a unified framework, GLEE +accomplishes detection, segmentation, tracking, grounding, and identification +of arbitrary objects in the open world scenario for various object perception +tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from +diverse data sources with varying supervision levels to formulate general +object representations, excelling in zero-shot transfer to new data and tasks. +Specifically, we employ an image encoder, text encoder, and visual prompter to +handle multi-modal inputs, enabling to simultaneously solve various +object-centric downstream tasks while maintaining state-of-the-art performance. +Demonstrated through extensive training on over five million images from +diverse benchmarks, GLEE exhibits remarkable versatility and improved +generalization performance, efficiently tackling downstream tasks without the +need for task-specific adaptation. By integrating large volumes of +automatically labeled data, we further enhance its zero-shot generalization +capabilities. Additionally, GLEE is capable of being integrated into Large +Language Models, serving as a foundational model to provide universal +object-level information for multi-modal tasks. We hope that the versatility +and universality of our method will mark a significant step in the development +of efficient visual foundation models for AGI systems. The model and code will +be released at https://glee-vision.github.io .",cs.CV,['cs.CV'] +Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning,Siteng Huang · Biao Gong · Yutong Feng · Zhang Min · Yiliang Lv · Donglin Wang, ,https://arxiv.org/abs/2311.14749,,2311.14749.pdf,Compositional Zero-shot Learning via Progressive Language-based Observations,"Compositional zero-shot learning aims to recognize unseen state-object +compositions by leveraging known primitives (state and object) during training. +However, effectively modeling interactions between primitives and generalizing +knowledge to novel compositions remains a perennial challenge. There are two +key factors: object-conditioned and state-conditioned variance, i.e., the +appearance of states (or objects) can vary significantly when combined with +different objects (or states). For instance, the state ""old"" can signify a +vintage design for a ""car"" or an advanced age for a ""cat"". In this paper, we +argue that these variances can be mitigated by predicting composition +categories based on pre-observed primitive. To this end, we propose Progressive +Language-based Observations (PLO), which can dynamically determine a better +observation order of primitives. These observations comprise a series of +concepts or languages that allow the model to understand image content in a +step-by-step manner. Specifically, PLO adopts pre-trained vision-language +models (VLMs) to empower the model with observation capabilities. We further +devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing +classifier dynamically determines the observation order of two primitives. 2) +PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to +craft composition-specific prompts for step-by-step observing. Extensive +ablations on three challenging datasets demonstrate the superiority of PLO +compared with state-of-the-art methods, affirming its abilities in +compositional recognition.",cs.CV,['cs.CV'] +EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models,Sijie Cheng · Zhicheng Guo · Jingwen Wu · Kechen Fang · Peng Li · Huaping Liu · Yang Liu,https://adacheng.github.io/EgoThink/,https://arxiv.org/abs/2311.15596,,2311.15596.pdf,EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models,"Vision-language models (VLMs) have recently shown promising results in +traditional downstream tasks. Evaluation studies have emerged to assess their +abilities, with the majority focusing on the third-person perspective, and only +a few addressing specific tasks from the first-person perspective. However, the +capability of VLMs to ""think"" from a first-person perspective, a crucial +attribute for advancing autonomous agents and robotics, remains largely +unexplored. To bridge this research gap, we introduce EgoThink, a novel visual +question-answering benchmark that encompasses six core capabilities with twelve +detailed dimensions. The benchmark is constructed using selected clips from +egocentric videos, with manually annotated question-answer pairs containing +first-person information. To comprehensively assess VLMs, we evaluate eighteen +popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, +we use GPT-4 as the automatic judge to compute single-answer grading. +Experimental results indicate that although GPT-4V leads in numerous +dimensions, all evaluated VLMs still possess considerable potential for +improvement in first-person perspective tasks. Meanwhile, enlarging the number +of trainable parameters has the most significant impact on model performance on +EgoThink. In conclusion, EgoThink serves as a valuable addition to existing +evaluation benchmarks for VLMs, providing an indispensable resource for future +research in the realm of embodied artificial intelligence and robotics.",cs.CV,"['cs.CV', 'cs.CL']" +Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields,Haoyuan Wang · Wenbo Hu · Lei Zhu · Rynson W.H. Lau,https://www.whyy.site/paper/nep,https://arxiv.org/abs/2403.16224,,2403.16224.pdf,Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields,"Inverse rendering aims at recovering both geometry and materials of objects. +It provides a more compatible reconstruction for conventional rendering +engines, compared with the neural radiance fields (NeRFs). On the other hand, +existing NeRF-based inverse rendering methods cannot handle glossy objects with +local light interactions well, as they typically oversimplify the illumination +as a 2D environmental map, which assumes infinite lights only. Observing the +superiority of NeRFs in recovering radiance fields, we propose a novel 5D +Neural Plenoptic Function (NeP) based on NeRFs and ray tracing, such that more +accurate lighting-object interactions can be formulated via the rendering +equation. We also design a material-aware cone sampling strategy to efficiently +integrate lights inside the BRDF lobes with the help of pre-filtered radiance +fields. Our method has two stages: the geometry of the target object and the +pre-filtered environmental radiance fields are reconstructed in the first +stage, and materials of the target object are estimated in the second stage +with the proposed NeP and material-aware cone sampling strategy. Extensive +experiments on the proposed real-world and synthetic datasets demonstrate that +our method can reconstruct high-fidelity geometry/materials of challenging +glossy objects with complex lighting interactions from nearby objects. Project +webpage: https://whyy.site/paper/nep",cs.CV,['cs.CV'] +Collaborating Foundation models for Domain Generalized Semantic Segmentation,Yasser Benigmim · Subhankar Roy · Slim Essid · Vicky Kalogeiton · Stéphane Lathuilière,https://yasserben.github.io/CLOUDS/,https://arxiv.org/abs/2312.09788,,2312.09788.pdf,Collaborating Foundation Models for Domain Generalized Semantic Segmentation,"Domain Generalized Semantic Segmentation (DGSS) deals with training a model +on a labeled source domain with the aim of generalizing to unseen domains +during inference. Existing DGSS methods typically effectuate robust features by +means of Domain Randomization (DR). Such an approach is often limited as it can +only account for style diversification and not content. In this work, we take +an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative +FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In +detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP +backbone for its robust feature representation, (ii) generative models to +diversify the content, thereby covering various modes of the possible target +distribution, and (iii) Segment Anything Model (SAM) for iteratively refining +the predictions of the segmentation model. Extensive experiments show that our +CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under +varying weather conditions, notably outperforming prior methods by 5.6% and +6.7% on averaged miou, respectively. The code is available at : +https://github.com/yasserben/CLOUDS",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +CausalPC: Improving the Robustness of Point Cloud Classification by Causal Effect Identification,Yuanmin Huang · Mi Zhang · Daizong Ding · Erling Jiang · Zhaoxiang Wang · Min Yang, ,,https://www.semanticscholar.org/paper/Deep-learning-for-large-scale-point-cloud-in-causal-Zhang-Ji/e1c76c0ba122201e813e3349dc0ebc8bde90eb34,,,,,nan +Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning,Menghao Zhang · Jingyu Wang · Qi Qi · Haifeng Sun · Zirui Zhuang · Pengfei Ren · Ruilong Ma · Jianxin Liao, ,https://arxiv.org/abs/2306.10239,,2306.10239.pdf,Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection,"Video Anomaly Detection (VAD) is an essential yet challenging task in signal +processing. Since certain anomalies cannot be detected by isolated analysis of +either temporal or spatial information, the interaction between these two types +of data is considered crucial for VAD. However, current dual-stream +architectures either confine this integral interaction to the bottleneck of the +autoencoder or introduce anomaly-irrelevant background pixels into the +interactive process, hindering the accuracy of VAD. To address these +deficiencies, we propose a Multi-scale Spatial-Temporal Interaction Network +(MSTI-Net) for VAD. First, to prioritize the detection of moving objects in the +scene and harmonize the substantial semantic discrepancies between the two +types of data, we propose an Attention-based Spatial-Temporal Fusion Module +(ASTFM) as a substitute for the conventional direct fusion. Furthermore, we +inject multi-ASTFM-based connections that bridge the appearance and motion +streams of the dual-stream network, thus fostering multi-scale spatial-temporal +interaction. Finally, to bolster the delineation between normal and abnormal +activities, our system records the regular information in a memory module. +Experimental results on three benchmark datasets validate the effectiveness of +our approach, which achieves AUCs of 96.8%, 87.6%, and 73.9% on the UCSD Ped2, +CUHK Avenue, and ShanghaiTech datasets, respectively.",cs.CV,['cs.CV'] +Data-Free Quantization via Pseudo-label Filtering,Chunxiao Fan · Ziqi Wang · Dan Guo · Meng Wang, ,http://export.arxiv.org/abs/2403.11256,,2403.11256.pdf,Uncertainty-Aware Pseudo-Label Filtering for Source-Free Unsupervised Domain Adaptation,"Source-free unsupervised domain adaptation (SFUDA) aims to enable the +utilization of a pre-trained source model in an unlabeled target domain without +access to source data. Self-training is a way to solve SFUDA, where confident +target samples are iteratively selected as pseudo-labeled samples to guide +target model learning. However, prior heuristic noisy pseudo-label filtering +methods all involve introducing extra models, which are sensitive to model +assumptions and may introduce additional errors or mislabeling. In this work, +we propose a method called Uncertainty-aware Pseudo-label-filtering Adaptation +(UPA) to efficiently address this issue in a coarse-to-fine manner. Specially, +we first introduce a sample selection module named Adaptive Pseudo-label +Selection (APS), which is responsible for filtering noisy pseudo labels. The +APS utilizes a simple sample uncertainty estimation method by aggregating +knowledge from neighboring samples and confident samples are selected as clean +pseudo-labeled. Additionally, we incorporate Class-Aware Contrastive Learning +(CACL) to mitigate the memorization of pseudo-label noise by learning robust +pair-wise representation supervised by pseudo labels. Through extensive +experiments conducted on three widely used benchmarks, we demonstrate that our +proposed method achieves competitive performance on par with state-of-the-art +SFUDA methods. Code is available at https://github.com/chenxi52/UPA.",cs.CV,['cs.CV'] +Adaptive Softassign via Hadamard-Equipped Sinkhorn,Binrui Shen · Qiang Niu · Shengxin Zhu, ,https://arxiv.org/abs/2309.13855,,2309.13855.pdf,Adaptive Softassign via Hadamard-Equipped Sinkhorn,"Softassign is a pivotal method in graph matching and other learning tasks. +Many softassign-based algorithms exhibit performance sensitivity to a parameter +in the softassign. However, tuning the parameter is challenging and almost done +empirically. This paper proposes an adaptive softassign method for graph +matching by analyzing the relationship between the objective score and the +parameter. This method can automatically tune the parameter based on a given +error bound to guarantee accuracy. The Hadamard-Equipped Sinkhorn formulas +introduced in this study significantly enhance the efficiency and stability of +the adaptive softassign. Moreover, these formulas can also be used in optimal +transport problems. The resulting adaptive softassign graph matching algorithm +enjoys significantly higher accuracy than previous state-of-the-art large graph +matching algorithms while maintaining comparable efficiency.",math.OC,"['math.OC', 'math.CO']" +SIGNeRF: Scene Integrated Generation for Neural Radiance Fields,Jan-Niklas Dihlmann · Andreas Engelhardt · Hendrik Lensch,https://signerf.jdihlmann.com/,https://arxiv.org/abs/2401.01647,,2401.01647.pdf,SIGNeRF: Scene Integrated Generation for Neural Radiance Fields,"Advances in image diffusion models have recently led to notable improvements +in the generation of high-quality images. In combination with Neural Radiance +Fields (NeRFs), they enabled new opportunities in 3D generation. However, most +generative 3D approaches are object-centric and applying them to editing +existing photorealistic scenes is not trivial. We propose SIGNeRF, a novel +approach for fast and controllable NeRF scene editing and scene-integrated +object generation. A new generative update strategy ensures 3D consistency +across the edited images, without requiring iterative optimization. We find +that depth-conditioned diffusion models inherently possess the capability to +generate 3D consistent views by requesting a grid of images instead of single +views. Based on these insights, we introduce a multi-view reference sheet of +modified images. Our method updates an image collection consistently based on +the reference sheet and refines the original NeRF with the newly generated +image set in one go. By exploiting the depth conditioning mechanism of the +image diffusion model, we gain fine control over the spatial location of the +edit and enforce shape guidance by a selected region or an external mesh.",cs.CV,"['cs.CV', 'cs.GR']" +Putting the Object Back into Video Object Segmentation,Ho Kei Cheng · Seoung Wug Oh · Brian Price · Joon-Young Lee · Alexander G. Schwing,https://hkchengrex.com/Cutie/,https://arxiv.org/abs/2310.12982,,2310.12982.pdf,Putting the Object Back into Video Object Segmentation,"We present Cutie, a video object segmentation (VOS) network with object-level +memory reading, which puts the object representation from memory back into the +video object segmentation result. Recent works on VOS employ bottom-up +pixel-level memory reading which struggles due to matching noise, especially in +the presence of distractors, resulting in lower performance in more challenging +data. In contrast, Cutie performs top-down object-level memory reading by +adapting a small set of object queries. Via those, it interacts with the +bottom-up pixel features iteratively with a query-based object transformer (qt, +hence Cutie). The object queries act as a high-level summary of the target +object, while high-resolution feature maps are retained for accurate +segmentation. Together with foreground-background masked attention, Cutie +cleanly separates the semantics of the foreground object from the background. +On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a +similar running time and improves by 4.2 J&F over DeAOT while being three times +faster. Code is available at: https://hkchengrex.github.io/Cutie",cs.CV,['cs.CV'] +Generalized Predictive Model for Autonomous Driving,Jiazhi Yang · Shenyuan Gao · Yihang Qiu · Li Chen · Tianyu Li · Bo Dai · Kashyap Chitta · Penghao Wu · Jia Zeng · Ping Luo · Jun Zhang · Andreas Geiger · Yu Qiao · Hongyang Li,https://github.com/OpenDriveLab/DriveAGI,https://arxiv.org/abs/2403.09630,,2403.09630.pdf,Generalized Predictive Model for Autonomous Driving,"In this paper, we introduce the first large-scale video prediction model in +the autonomous driving discipline. To eliminate the restriction of high-cost +data collection and empower the generalization ability of our model, we acquire +massive data from the web and pair it with diverse and high-quality text +descriptions. The resultant dataset accumulates over 2000 hours of driving +videos, spanning areas all over the world with diverse weather conditions and +traffic scenarios. Inheriting the merits from recent latent diffusion models, +our model, dubbed GenAD, handles the challenging dynamics in driving scenes +with novel temporal reasoning blocks. We showcase that it can generalize to +various unseen driving datasets in a zero-shot manner, surpassing general or +driving-specific video prediction counterparts. Furthermore, GenAD can be +adapted into an action-conditioned prediction model or a motion planner, +holding great potential for real-world driving applications.",cs.CV,['cs.CV'] +BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition,Yuxuan Zhou · Xudong Yan · Zhi-Qi Cheng · Yan Yan · Qi Dai · Xian-Sheng Hua,https://github.com/ZhouYuxuanYX/BlockGCN,https://arxiv.org/html/2305.11468v3,,2305.11468v3.pdf,Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness,"Graph Convolutional Networks (GCNs) have long defined the state-of-the-art in +skeleton-based action recognition, leveraging their ability to unravel the +complex dynamics of human joint topology through the graph's adjacency matrix. +However, an inherent flaw has come to light in these cutting-edge models: they +tend to optimize the adjacency matrix jointly with the model weights. This +process, while seemingly efficient, causes a gradual decay of bone connectivity +data, culminating in a model indifferent to the very topology it sought to map. +As a remedy, we propose a threefold strategy: (1) We forge an innovative +pathway that encodes bone connectivity by harnessing the power of graph +distances. This approach preserves the vital topological nuances often lost in +conventional GCNs. (2) We highlight an oft-overlooked feature - the temporal +mean of a skeletal sequence, which, despite its modest guise, carries highly +action-specific information. (3) Our investigation revealed strong variations +in joint-to-joint relationships across different actions. This finding exposes +the limitations of a single adjacency matrix in capturing the variations of +relational configurations emblematic of human movement, which we remedy by +proposing an efficient refinement to Graph Convolutions (GC) - the BlockGC. +This evolution slashes parameters by a substantial margin (above 40%), while +elevating performance beyond original GCNs. Our full model, the BlockGCN, +establishes new standards in skeleton-based action recognition for small model +sizes. Its high accuracy, notably on the large-scale NTU RGB+D 120 dataset, +stand as compelling proof of the efficacy of BlockGCN.",cs.CV,['cs.CV'] +MotionEditor: Editing Video Motion via Content-Aware Diffusion,Shuyuan Tu · Qi Dai · Zhi-Qi Cheng · Han Hu · Xintong Han · Zuxuan Wu · Yu-Gang Jiang, ,https://arxiv.org/abs/2311.18830,,2311.18830.pdf,MotionEditor: Editing Video Motion via Content-Aware Diffusion,"Existing diffusion-based video editing models have made gorgeous advances for +editing attributes of a source video over time but struggle to manipulate the +motion information while preserving the original protagonist's appearance and +background. To address this, we propose MotionEditor, a diffusion model for +video motion editing. MotionEditor incorporates a novel content-aware motion +adapter into ControlNet to capture temporal motion correspondence. While +ControlNet enables direct generation based on skeleton poses, it encounters +challenges when modifying the source motion in the inverted noise due to +contradictory signals between the noise (source) and the condition (reference). +Our adapter complements ControlNet by involving source content to transfer +adapted control signals seamlessly. Further, we build up a two-branch +architecture (a reconstruction branch and an editing branch) with a +high-fidelity attention injection mechanism facilitating branch interaction. +This mechanism enables the editing branch to query the key and value from the +reconstruction branch in a decoupled manner, making the editing branch retain +the original background and protagonist appearance. We also propose a skeleton +alignment algorithm to address the discrepancies in pose size and position. +Experiments demonstrate the promising motion editing ability of MotionEditor, +both qualitatively and quantitatively.",cs.CV,['cs.CV'] +ReconFusion: 3D Reconstruction with Diffusion Priors,Rundi Wu · Ben Mildenhall · Philipp Henzler · Ruiqi Gao · Keunhong Park · Daniel Watson · Pratul P. Srinivasan · Dor Verbin · Jonathan T. Barron · Ben Poole · Aleksander Holynski,https://reconfusion.github.io,https://arxiv.org/abs/2312.02981v1,,2312.02981v1.pdf,ReconFusion: 3D Reconstruction with Diffusion Priors,"3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at +rendering photorealistic novel views of complex scenes. However, recovering a +high-quality NeRF typically requires tens to hundreds of input images, +resulting in a time-consuming capture process. We present ReconFusion to +reconstruct real-world scenes using only a few photos. Our approach leverages a +diffusion prior for novel view synthesis, trained on synthetic and multiview +datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel +camera poses beyond those captured by the set of input images. Our method +synthesizes realistic geometry and texture in underconstrained regions while +preserving the appearance of observed regions. We perform an extensive +evaluation across various real-world datasets, including forward-facing and +360-degree scenes, demonstrating significant performance improvements over +previous few-view NeRF reconstruction approaches.",cs.CV,['cs.CV'] +Learning Vision from Models Rivals Learning Vision from Data,Yonglong Tian · Lijie Fan · Kaifeng Chen · Dina Katabi · Dilip Krishnan · Phillip Isola,https://github.com/google-research/syn-rep-learn/tree/main/SynCLR,https://arxiv.org/abs/2312.17742,,2312.17742.pdf,Learning Vision from Models Rivals Learning Vision from Data,"We introduce SynCLR, a novel approach for learning visual representations +exclusively from synthetic images and synthetic captions, without any real +data. We synthesize a large dataset of image captions using LLMs, then use an +off-the-shelf text-to-image model to generate multiple images corresponding to +each synthetic caption. We perform visual representation learning on these +synthetic images via contrastive learning, treating images sharing the same +caption as positive pairs. The resulting representations transfer well to many +downstream tasks, competing favorably with other general-purpose visual +representation learners such as CLIP and DINO v2 in image classification tasks. +Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR +outperforms previous self-supervised methods by a significant margin, e.g., +improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.",cs.CV,['cs.CV'] +"Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization",Mainak Singha · Ankit Jha · Shirsha Bose · Ashwin Nair · Moloud Abdar · Biplab Banerjee, ,https://arxiv.org/abs/2404.00710,,2404.00710.pdf,"Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization","We delve into Open Domain Generalization (ODG), marked by domain and category +shifts between training's labeled source and testing's unlabeled target +domains. Existing solutions to ODG face limitations due to constrained +generalizations of traditional CNN backbones and errors in detecting target +open samples in the absence of prior knowledge. Addressing these pitfalls, we +introduce ODG-CLIP, harnessing the semantic prowess of the vision-language +model, CLIP. Our framework brings forth three primary innovations: Firstly, +distinct from prevailing paradigms, we conceptualize ODG as a multi-class +classification challenge encompassing both known and novel categories. Central +to our approach is modeling a unique prompt tailored for detecting unknown +class samples, and to train this, we employ a readily accessible stable +diffusion model, elegantly generating proxy images for the open class. +Secondly, aiming for domain-tailored classification (prompt) weights while +ensuring a balance of precision and simplicity, we devise a novel visual +stylecentric prompt learning mechanism. Finally, we infuse images with +class-discriminative knowledge derived from the prompt space to augment the +fidelity of CLIP's visual embeddings. We introduce a novel objective to +safeguard the continuity of this infused semantic intel across domains, +especially for the shared classes. Through rigorous testing on diverse +datasets, covering closed and open-set DG contexts, ODG-CLIP demonstrates clear +supremacy, consistently outpacing peers with performance boosts between 8%-16%. +Code will be available at https://github.com/mainaksingha01/ODG-CLIP.",cs.CV,['cs.CV'] +Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds,Zhimin Yuan · Wankang Zeng · Yanfei Su · Weiquan Liu · Ming Cheng · Yulan Guo · Cheng Wang,https://github.com/yuan-zm/DGT-ST,https://arxiv.org/abs/2403.18469,,2403.18469.pdf,Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds,"3D synthetic-to-real unsupervised domain adaptive segmentation is crucial to +annotating new domains. Self-training is a competitive approach for this task, +but its performance is limited by different sensor sampling patterns (i.e., +variations in point density) and incomplete training strategies. In this work, +we propose a density-guided translator (DGT), which translates point density +between domains, and integrates it into a two-stage self-training pipeline +named DGT-ST. First, in contrast to existing works that simultaneously conduct +data generation and feature/output alignment within unstable adversarial +training, we employ the non-learnable DGT to bridge the domain gap at the input +level. Second, to provide a well-initialized model for self-training, we +propose a category-level adversarial network in stage one that utilizes the +prototype to prevent negative transfer. Finally, by leveraging the designs +above, a domain-mixed self-training method with source-aware consistency loss +is proposed in stage two to narrow the domain gap further. Experiments on two +synthetic-to-real segmentation tasks (SynLiDAR $\rightarrow$ semanticKITTI and +SynLiDAR $\rightarrow$ semanticPOSS) demonstrate that DGT-ST outperforms +state-of-the-art methods, achieving 9.4$\%$ and 4.3$\%$ mIoU improvements, +respectively. Code is available at \url{https://github.com/yuan-zm/DGT-ST}.",cs.CV,"['cs.CV', 'cs.AI']" +Absolute Pose from One or Two Scaled and Oriented Features,Jonathan Ventura · Zuzana Kukelova · Torsten Sattler · Daniel Barath,https://github.com/danini/absolute-pose-from-oriented-and-scaled-features,https://arxiv.org/abs/2404.16552,,,Efficient Solution of Point-Line Absolute Pose,"We revisit certain problems of pose estimation based on 3D--2D +correspondences between features which may be points or lines. Specifically, we +address the two previously-studied minimal problems of estimating camera +extrinsics from $p \in \{ 1, 2 \}$ point--point correspondences and $l=3-p$ +line--line correspondences. To the best of our knowledge, all of the +previously-known practical solutions to these problems required computing the +roots of degree $\ge 4$ (univariate) polynomials when $p=2$, or degree $\ge 8$ +polynomials when $p=1.$ We describe and implement two elementary solutions +which reduce the degrees of the needed polynomials from $4$ to $2$ and from $8$ +to $4$, respectively. We show experimentally that the resulting solvers are +numerically stable and fast: when compared to the previous state-of-the art, we +may obtain nearly an order of magnitude speedup. The code is available at +\url{https://github.com/petrhruby97/efficient\_absolute}",cs.CV,"['cs.CV', '68T45', 'I.4.5']" +IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images,Yushuang Wu · Luyue Shi · Junhao Cai · Weihao Yuan · Lingteng Qiu · Zilong Dong · Liefeng Bo · Shuguang Cui · Xiaoguang Han,https://yushuang-wu.github.io/IPoD/,https://arxiv.org/abs/2404.00269,,2404.00269.pdf,IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images,"Generalizable 3D object reconstruction from single-view RGB-D images remains +a challenging task, particularly with real-world data. Current state-of-the-art +methods develop Transformer-based implicit field learning, necessitating an +intensive learning paradigm that requires dense query-supervision uniformly +sampled throughout the entire space. We propose a novel approach, IPoD, which +harmonizes implicit field learning with point diffusion. This approach treats +the query points for implicit field learning as a noisy point cloud for +iterative denoising, allowing for their dynamic adaptation to the target object +shape. Such adaptive query points harness diffusion learning's capability for +coarse shape recovery and also enhances the implicit representation's ability +to delineate finer details. Besides, an additional self-conditioning mechanism +is designed to use implicit predictions as the guidance of diffusion learning, +leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset +affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% +in Chamfer distance over existing methods. The generalizability of IPoD is also +demonstrated on the MVImgNet dataset. Our project page is at +https://yushuang-wu.github.io/IPoD.",cs.CV,['cs.CV'] +MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures,Zhangyang Xiong · Chenghong Li · Kenkun Liu · Hongjie Liao · Jianqiao HU · Junyi Zhu · Shuliang Ning · Lingteng Qiu · Chongjie Wang · Shijie Wang · Shuguang Cui · Xiaoguang Han, ,https://arxiv.org/abs/2312.02963,,2312.02963.pdf,MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures,"In this era, the success of large language models and text-to-image models +can be attributed to the driving force of large-scale datasets. However, in the +realm of 3D vision, while remarkable progress has been made with models trained +on large-scale synthetic and real-captured object data like Objaverse and +MVImgNet, a similar level of progress has not been observed in the domain of +human-centric tasks partially due to the lack of a large-scale human dataset. +Existing datasets of high-fidelity 3D human capture continue to be mid-sized +due to the significant challenges in acquiring large-scale high-quality 3D +human data. To bridge this gap, we present MVHumanNet, a dataset that comprises +multi-view human action sequences of 4,500 human identities. The primary focus +of our work is on collecting human data that features a large number of diverse +identities and everyday clothing using a multi-view human capture system, which +facilitates easily scalable data collection. Our dataset contains 9,000 daily +outfits, 60,000 motion sequences and 645 million frames with extensive +annotations, including human masks, camera parameters, 2D and 3D keypoints, +SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the +potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot +studies on view-consistent action recognition, human NeRF reconstruction, +text-driven view-unconstrained human image generation, as well as 2D +view-unconstrained human image and 3D avatar generation. Extensive experiments +demonstrate the performance improvements and effective applications enabled by +the scale provided by MVHumanNet. As the current largest-scale 3D human +dataset, we hope that the release of MVHumanNet data with annotations will +foster further innovations in the domain of 3D human-centric tasks at scale.",cs.CV,['cs.CV'] +SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing,Zeyinzi Jiang · Chaojie Mao · Yulin Pan · Zhen Han · Jingfeng Zhang,https://scedit.github.io/,https://arxiv.org/abs/2312.11392,,2312.11392.pdf,SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing,"Image diffusion models have been utilized in various tasks, such as +text-to-image generation and controllable image synthesis. Recent research has +introduced tuning methods that make subtle adjustments to the original models, +yielding promising results in specific adaptations of foundational generative +diffusion models. Rather than modifying the main backbone of the diffusion +model, we delve into the role of skip connection in U-Net and reveal that +hierarchical features aggregating long-distance information across encoder and +decoder make a significant impact on the content and quality of image +generation. Based on the observation, we propose an efficient generative tuning +framework, dubbed SCEdit, which integrates and edits Skip Connection using a +lightweight tuning module named SC-Tuner. Furthermore, the proposed framework +allows for straightforward extension to controllable image synthesis by +injecting different conditions with Controllable SC-Tuner, simplifying and +unifying the network design for multi-condition inputs. Our SCEdit +substantially reduces training parameters, memory usage, and computational +expense due to its lightweight tuners, with backward propagation only passing +to the decoder blocks. Extensive experiments conducted on text-to-image +generation and controllable image synthesis tasks demonstrate the superiority +of our method in terms of efficiency and performance. Project page: +\url{https://scedit.github.io/}",cs.CV,['cs.CV'] +Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis,Zanlin Ni · Yulin Wang · Renping Zhou · Jiayi Guo · Jinyi Hu · Zhiyuan Liu · Shiji Song · Yuan Yao · Gao Huang, ,https://arxiv.org/html/2312.14988v1,,2312.14988v1.pdf,Emage: Non-Autoregressive Text-to-Image Generation,"Autoregressive and diffusion models drive the recent breakthroughs on +text-to-image generation. Despite their huge success of generating +high-realistic images, a common shortcoming of these models is their high +inference latency - autoregressive models run more than a thousand times +successively to produce image tokens and diffusion models convert Gaussian +noise into images with many hundreds of denoising steps. In this work, we +explore non-autoregressive text-to-image models that efficiently generate +hundreds of image tokens in parallel. We develop many model variations with +different learning and inference strategies, initialized text encoders, etc. +Compared with autoregressive baselines that needs to run one thousand times, +our model only runs 16 times to generate images of competitive quality with an +order of magnitude lower inference latency. Our non-autoregressive model with +346M parameters generates an image of 256$\times$256 with about one second on +one V100 GPU.",cs.CV,['cs.CV'] +360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model,Qian Wang · Weiqi Li · Chong Mou · Xinhua Cheng · Jian Zhang, ,https://arxiv.org/abs/2401.06578,,2401.06578.pdf,360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model,"Panorama video recently attracts more interest in both study and application, +courtesy of its immersive experience. Due to the expensive cost of capturing +360-degree panoramic videos, generating desirable panorama videos by prompts is +urgently required. Lately, the emerging text-to-video (T2V) diffusion methods +demonstrate notable effectiveness in standard video generation. However, due to +the significant gap in content and motion patterns between panoramic and +standard videos, these methods encounter challenges in yielding satisfactory +360-degree panoramic videos. In this paper, we propose a pipeline named +360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic +videos based on the given prompts and motion conditions. Specifically, we +introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques +to transform pre-trained T2V models for panorama video generation. We further +propose a new panorama dataset named WEB360 consisting of panoramic video-text +pairs for training 360DVD, addressing the absence of captioned panoramic video +datasets. Extensive experiments demonstrate the superiority and effectiveness +of 360DVD for panorama video generation. Our project page is at +https://akaneqwq.github.io/360DVD/.",cs.CV,['cs.CV'] +All in One Framework for Multimodal Re-identification in the Wild,He Li · Mang Ye · Ming Zhang · Bo Du, ,https://arxiv.org/abs/2405.04741,,2405.04741.pdf,All in One Framework for Multimodal Re-identification in the Wild,"In Re-identification (ReID), recent advancements yield noteworthy progress in +both unimodal and cross-modal retrieval tasks. However, the challenge persists +in developing a unified framework that could effectively handle varying +multimodal data, including RGB, infrared, sketches, and textual information. +Additionally, the emergence of large-scale models shows promising performance +in various vision tasks but the foundation model in ReID is still blank. In +response to these challenges, a novel multimodal learning paradigm for ReID is +introduced, referred to as All-in-One (AIO), which harnesses a frozen +pre-trained big model as an encoder, enabling effective multimodal retrieval +without additional fine-tuning. The diverse multimodal data in AIO are +seamlessly tokenized into a unified space, allowing the modality-shared frozen +encoder to extract identity-consistent features comprehensively across all +modalities. Furthermore, a meticulously crafted ensemble of cross-modality +heads is designed to guide the learning trajectory. AIO is the \textbf{first} +framework to perform all-in-one ReID, encompassing four commonly used +modalities. Experiments on cross-modal and multimodal ReID reveal that AIO not +only adeptly handles various modal data but also excels in challenging +contexts, showcasing exceptional performance in zero-shot and domain +generalization scenarios.",cs.CV,['cs.CV'] +TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models,Zhongwei Zhang · Fuchen Long · Yingwei Pan · Zhaofan Qiu · Ting Yao · Yang Cao · Tao Mei,https://trip-i2v.github.io/TRIP/,https://arxiv.org/abs/2403.17005v1,,2403.17005v1.pdf,TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models,"Recent advances in text-to-video generation have demonstrated the utility of +powerful diffusion models. Nevertheless, the problem is not trivial when +shaping diffusion models to animate static image (i.e., image-to-video +generation). The difficulty originates from the aspect that the diffusion +process of subsequent animated frames should not only preserve the faithful +alignment with the given image but also pursue temporal coherence among +adjacent frames. To alleviate this, we present TRIP, a new recipe of +image-to-video diffusion paradigm that pivots on image noise prior derived from +static image to jointly trigger inter-frame relational reasoning and ease the +coherent temporal modeling via temporal residual learning. Technically, the +image noise prior is first attained through one-step backward diffusion process +based on both static image and noised video latent codes. Next, TRIP executes a +residual-like dual-path scheme for noise prediction: 1) a shortcut path that +directly takes image noise prior as the reference noise of each frame to +amplify the alignment between the first frame and subsequent frames; 2) a +residual path that employs 3D-UNet over noised video and static image latent +codes to enable inter-frame relational reasoning, thereby easing the learning +of the residual noise for each frame. Furthermore, both reference and residual +noise of each frame are dynamically merged via attention mechanism for final +video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT +datasets demonstrate the effectiveness of our TRIP for image-to-video +generation. Please see our project page at https://trip-i2v.github.io/TRIP/.",cs.CV,"['cs.CV', 'cs.MM']" +Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities,Yiyuan Zhang · Xiaohan Ding · Kaixiong Gong · Yixiao Ge · Ying Shan · Xiangyu Yue, ,https://arxiv.org/abs/2401.14405,,2401.14405.pdf,Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities,"We propose to improve transformers of a specific modality with irrelevant +data from other modalities, e.g., improve an ImageNet model with audio or point +cloud datasets. We would like to highlight that the data samples of the target +modality are irrelevant to the other modalities, which distinguishes our method +from other works utilizing paired (e.g., CLIP) or interleaved data of different +modalities. We propose a methodology named Multimodal Pathway - given a target +modality and a transformer designed for it, we use an auxiliary transformer +trained with data of another modality and construct pathways to connect +components of the two models so that data of the target modality can be +processed by both models. In this way, we utilize the universal +sequence-to-sequence modeling abilities of transformers obtained from two +modalities. As a concrete implementation, we use a modality-specific tokenizer +and task-specific head as usual but utilize the transformer blocks of the +auxiliary model via a proposed method named Cross-Modal Re-parameterization, +which exploits the auxiliary weights without any inference costs. On the image, +point cloud, video, and audio recognition tasks, we observe significant and +consistent performance improvements with irrelevant data from other modalities. +The code and models are available at https://github.com/AILab-CVC/M2PT.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Geometry Transfer for Stylizing Radiance Fields,Hyunyoung Jung · Seonghyeon Nam · Nikolaos Sarafianos · Sungjoo Yoo · Alexander Sorkine-Hornung · Rakesh Ranjan,https://hyblue.github.io/geo-srf/,https://arxiv.org/abs/2402.00863,,2402.00863.pdf,Geometry Transfer for Stylizing Radiance Fields,"Shape and geometric patterns are essential in defining stylistic identity. +However, current 3D style transfer methods predominantly focus on transferring +colors and textures, often overlooking geometric aspects. In this paper, we +introduce Geometry Transfer, a novel method that leverages geometric +deformation for 3D style transfer. This technique employs depth maps to extract +a style guide, subsequently applied to stylize the geometry of radiance fields. +Moreover, we propose new techniques that utilize geometric cues from the 3D +scene, thereby enhancing aesthetic expressiveness and more accurately +reflecting intended styles. Our extensive experiments show that Geometry +Transfer enables a broader and more expressive range of stylizations, thereby +significantly expanding the scope of 3D style transfer.",cs.CV,['cs.CV'] +HRVDA: High-Resolution Visual Document Assistant,Chaohu Liu · Kun Yin · Haoyu Cao · Xinghua Jiang · Xin Li · Yinsong Liu · Deqiang Jiang · Xing Sun · Linli Xu, ,https://arxiv.org/abs/2404.06918,,2404.06918.pdf,HRVDA: High-Resolution Visual Document Assistant,"Leveraging vast training data, multimodal large language models (MLLMs) have +demonstrated formidable general visual comprehension capabilities and achieved +remarkable performance across various tasks. However, their performance in +visual document understanding still leaves much room for improvement. This +discrepancy is primarily attributed to the fact that visual document +understanding is a fine-grained prediction task. In natural scenes, MLLMs +typically use low-resolution images, leading to a substantial loss of visual +information. Furthermore, general-purpose MLLMs do not excel in handling +document-oriented instructions. In this paper, we propose a High-Resolution +Visual Document Assistant (HRVDA), which bridges the gap between MLLMs and +visual document understanding. This model employs a content filtering mechanism +and an instruction filtering module to separately filter out the +content-agnostic visual tokens and instruction-agnostic visual tokens, thereby +achieving efficient model training and inference for high-resolution images. In +addition, we construct a document-oriented visual instruction tuning dataset +and apply a multi-stage training strategy to enhance the model's document +modeling capabilities. Extensive experiments demonstrate that our model +achieves state-of-the-art performance across multiple document understanding +datasets, while maintaining training efficiency and inference speed comparable +to low-resolution models.",cs.CV,['cs.CV'] +TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes,Xuying Zhang · Bo-Wen Yin · yuming chen · Zheng Lin · Yunheng Li · Qibin Hou · Ming-Ming Cheng, ,https://arxiv.org/abs/2312.04248,,2312.04248.pdf,TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes,"Recent progress in the text-driven 3D stylization of a single object has been +considerably promoted by CLIP-based methods. However, the stylization of +multi-object 3D scenes is still impeded in that the image-text pairs used for +pre-training CLIP mostly consist of an object. Meanwhile, the local details of +multiple objects may be susceptible to omission due to the existing supervision +manner primarily relying on coarse-grained contrast of image-text pairs. To +overcome these challenges, we present a novel framework, dubbed TeMO, to parse +multi-object 3D scenes and edit their styles under the contrast supervision at +multiple levels. We first propose a Decoupled Graph Attention (DGA) module to +distinguishably reinforce the features of 3D surface points. Particularly, a +cross-modal graph is constructed to align the object points accurately and noun +phrases decoupled from the 3D mesh and textual description. Then, we develop a +Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss +between the words in the textual description and the randomly rendered images +are constructed to complement the coarse-grained loss. Extensive experiments +show that our method can synthesize high-quality stylized content and +outperform the existing methods over a wide range of multi-object 3D meshes. +Our code and results will be made publicly available",cs.CV,['cs.CV'] +Revisiting Single Image Reflection Removal In the Wild,Yurui Zhu · Bo Li · Xueyang Fu · Peng-Tao Jiang · Hao Zhang · Qibin Sun · Zheng-Jun Zha · Jinwei Chen, ,https://arxiv.org/abs/2311.17320,,2311.17320.pdf,Revisiting Single Image Reflection Removal In the Wild,"This research focuses on the issue of single-image reflection removal (SIRR) +in real-world conditions, examining it from two angles: the collection pipeline +of real reflection pairs and the perception of real reflection locations. We +devise an advanced reflection collection pipeline that is highly adaptable to a +wide range of real-world reflection scenarios and incurs reduced costs in +collecting large-scale aligned reflection pairs. In the process, we develop a +large-scale, high-quality reflection dataset named Reflection Removal in the +Wild (RRW). RRW contains over 14,950 high-resolution real-world reflection +pairs, a dataset forty-five times larger than its predecessors. Regarding +perception of reflection locations, we identify that numerous virtual +reflection objects visible in reflection images are not present in the +corresponding ground-truth images. This observation, drawn from the aligned +pairs, leads us to conceive the Maximum Reflection Filter (MaxRF). The MaxRF +could accurately and explicitly characterize reflection locations from pairs of +images. Building upon this, we design a reflection location-aware cascaded +framework, specifically tailored for SIRR. Powered by these innovative +techniques, our solution achieves superior performance than current leading +methods across multiple real-world benchmarks. Codes and datasets will be +publicly available.",cs.CV,['cs.CV'] +Inlier Confidence Calibration for Point Cloud Registration,Yongzhe Yuan · Yue Wu · Xiaolong Fan · Maoguo Gong · Qiguang Miao · Wenping Ma, ,https://arxiv.org/abs/2307.14019,,2307.14019.pdf,One-Nearest Neighborhood Guides Inlier Estimation for Unsupervised Point Cloud Registration,"The precision of unsupervised point cloud registration methods is typically +limited by the lack of reliable inlier estimation and self-supervised signal, +especially in partially overlapping scenarios. In this paper, we propose an +effective inlier estimation method for unsupervised point cloud registration by +capturing geometric structure consistency between the source point cloud and +its corresponding reference point cloud copy. Specifically, to obtain a high +quality reference point cloud copy, an One-Nearest Neighborhood (1-NN) point +cloud is generated by input point cloud. This facilitates matching map +construction and allows for integrating dual neighborhood matching scores of +1-NN point cloud and input point cloud to improve matching confidence. +Benefiting from the high quality reference copy, we argue that the neighborhood +graph formed by inlier and its neighborhood should have consistency between +source point cloud and its corresponding reference copy. Based on this +observation, we construct transformation-invariant geometric structure +representations and capture geometric structure consistency to score the inlier +confidence for estimated correspondences between source point cloud and its +reference copy. This strategy can simultaneously provide the reliable +self-supervised signal for model optimization. Finally, we further calculate +transformation estimation by the weighted SVD algorithm with the estimated +correspondences and corresponding inlier confidence. We train the proposed +model in an unsupervised manner, and extensive experiments on synthetic and +real-world datasets illustrate the effectiveness of the proposed method.",cs.CV,"['cs.CV', 'cs.AI']" +Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation,Yeonguk Yu · Sungho Shin · Seunghyeok Back · Minhwan Ko · Sangjun Noh · Kyoobin Lee,https://github.com/gist-ailab/domain-specific-block-selection-and-paired-view-pseudo-labeling-for-online-TTA,https://arxiv.org/abs/2404.10966v2,,2404.10966v2.pdf,Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation,"Test-time adaptation (TTA) aims to adapt a pre-trained model to a new test +domain without access to source data after deployment. Existing approaches +typically rely on self-training with pseudo-labels since ground-truth cannot be +obtained from test data. Although the quality of pseudo labels is important for +stable and accurate long-term adaptation, it has not been previously addressed. +In this work, we propose DPLOT, a simple yet effective TTA framework that +consists of two components: (1) domain-specific block selection and (2) +pseudo-label generation using paired-view images. Specifically, we select +blocks that involve domain-specific feature extraction and train these blocks +by entropy minimization. After blocks are adjusted for current test domain, we +generate pseudo-labels by averaging given test images and corresponding flipped +counterparts. By simply using flip augmentation, we prevent a decrease in the +quality of the pseudo-labels, which can be caused by the domain gap resulting +from strong augmentation. Our experimental results demonstrate that DPLOT +outperforms previous TTA methods in CIFAR10-C, CIFAR100-C, and ImageNet-C +benchmarks, reducing error by up to 5.4%, 9.1%, and 2.9%, respectively. Also, +we provide an extensive analysis to demonstrate effectiveness of our framework. +Code is available at +https://github.com/gist-ailab/domain-specific-block-selection-and-paired-view-pseudo-labeling-for-online-TTA.",cs.CV,['cs.CV'] +BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image,Minje Kim · Tae-Kyun Kim,https://yunminjin2.github.io/projects/bitt/,https://arxiv.org/abs/2403.08262,,2403.08262.pdf,BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image,"Creating personalized hand avatars is important to offer a realistic +experience to users on AR / VR platforms. While most prior studies focused on +reconstructing 3D hand shapes, some recent work has tackled the reconstruction +of hand textures on top of shapes. However, these methods are often limited to +capturing pixels on the visible side of a hand, requiring diverse views of the +hand in a video or multiple images as input. In this paper, we propose a novel +method, BiTT(Bi-directional Texture reconstruction of Two hands), which is the +first end-to-end trainable method for relightable, pose-free texture +reconstruction of two interacting hands taking only a single RGB image, by +three novel components: 1) bi-directional (left $\leftrightarrow$ right) +texture reconstruction using the texture symmetry of left / right hands, 2) +utilizing a texture parametric model for hand texture recovery, and 3) the +overall coarse-to-fine stage pipeline for reconstructing personalized texture +of two interacting hands. BiTT first estimates the scene light condition and +albedo image from an input image, then reconstructs the texture of both hands +through the texture parametric model and bi-directional texture reconstructor. +In experiments using InterHand2.6M and RGB2Hands datasets, our method +significantly outperforms state-of-the-art hand texture reconstruction methods +quantitatively and qualitatively. The code is available at +https://github.com/yunminjin2/BiTT",cs.CV,['cs.CV'] +Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes,Chi-Hsi Kung · 書緯 呂 · Yi-Hsuan Tsai · Yi-Ting Chen, ,https://arxiv.org/abs/2311.17948,,2311.17948.pdf,Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes,"In this paper, we study multi-label atomic activity recognition. Despite the +notable progress in action recognition, it is still challenging to recognize +atomic activities due to a deficiency in a holistic understanding of both +multiple road users' motions and their contextual information. In this paper, +we introduce Action-slot, a slot attention-based approach that learns visual +action-centric representations, capturing both motion and contextual +information. Our key idea is to design action slots that are capable of paying +attention to regions where atomic activities occur, without the need for +explicit perception guidance. To further enhance slot attention, we introduce a +background slot that competes with action slots, aiding the training process in +avoiding unnecessary focus on background regions devoid of activities. Yet, the +imbalanced class distribution in the existing dataset hampers the assessment of +rare activities. To address the limitation, we collect a synthetic dataset +called TACO, which is four times larger than OATS and features a balanced +distribution of atomic activities. To validate the effectiveness of our method, +we conduct comprehensive experiments and ablation studies against various +action recognition baselines. We also show that the performance of multi-label +atomic activity recognition on real-world datasets can be improved by +pretraining representations on TACO. We will release our source code and +dataset. See the videos of visualization on the project page: +https://hcis-lab.github.io/Action-slot/",cs.CV,"['cs.CV', 'cs.LG']" +"Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models",Luo Jiayun · Siddhesh Khandelwal · Leonid Sigal · Boyang Li, ,https://arxiv.org/abs/2311.17095v1,,2311.17095v1.pdf,"Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models","From an enormous amount of image-text pairs, large-scale vision-language +models (VLMs) learn to implicitly associate image regions with words, which is +vital for tasks such as image captioning and visual question answering. +However, leveraging such pre-trained models for open-vocabulary semantic +segmentation remains a challenge. In this paper, we propose a simple, yet +extremely effective, training-free technique, Plug-and-Play Open-Vocabulary +Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with +direct text-to-image cross-attention and an image-text matching loss to produce +semantic segmentation. However, cross-attention alone tends to over-segment, +whereas cross-attention plus GradCAM tend to under-segment. To alleviate this +issue, we introduce Salience Dropout; by iteratively dropping patches that the +model is most attentive to, we are able to better resolve the entire extent of +the segmentation mask. Compared to existing techniques, the proposed method +does not require any neural network training and performs hyperparameter tuning +without the need for any segmentation annotations, even for a validation set. +PnP-OVSS demonstrates substantial improvements over a comparable baseline +(+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS +COCO, +2.4% mIoU on COCO Stuff) and even outperforms most baselines that +conduct additional network training on top of pretrained VLMs.",cs.CV,"['cs.CV', 'cs.AI']" +Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange,Yanhao Wu · Tong Zhang · Wei Ke · Congpei Qiu · Sabine Süsstrunk · Mathieu Salzmann, ,,https://www.semanticscholar.org/paper/Mitigating-Object-Dependencies:-Improving-Point-Wu-Zhang/1cafd8d79a0e2242cb1f8a2ce26db175785ebf88,,,,,nan +MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes,Bor Shiun Wang · Chien-Yi Wang · Wei-Chen Chiu,https://eddie221.github.io/MCPNet/,https://arxiv.org/abs/2404.08968,,2404.08968.pdf,MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes,"Recent advancements in post-hoc and inherently interpretable methods have +markedly enhanced the explanations of black box classifier models. These +methods operate either through post-analysis or by integrating concept learning +during model training. Although being effective in bridging the semantic gap +between a model's latent space and human interpretation, these explanation +methods only partially reveal the model's decision-making process. The outcome +is typically limited to high-level semantics derived from the last feature map. +We argue that the explanations lacking insights into the decision processes at +low and mid-level features are neither fully faithful nor useful. Addressing +this gap, we introduce the Multi-Level Concept Prototypes Classifier (MCPNet), +an inherently interpretable model. MCPNet autonomously learns meaningful +concept prototypes across multiple feature map levels using Centered Kernel +Alignment (CKA) loss and an energy-based weighted PCA mechanism, and it does so +without reliance on predefined concept labels. Further, we propose a novel +classifier paradigm that learns and aligns multi-level concept prototype +distributions for classification purposes via Class-aware Concept Distribution +(CCD) loss. Our experiments reveal that our proposed MCPNet while being +adaptable to various model architectures, offers comprehensive multi-level +explanations while maintaining classification accuracy. Additionally, its +concept distribution-based classification approach shows improved +generalization capabilities in few-shot classification scenarios.",cs.CV,"['cs.CV', 'cs.LG']" +OmniMotionGPT: Animal Motion Generation with Limited Data,Zhangsihao Yang · Mingyuan Zhou · Mengyi Shan · Bingbing Wen · Ziwei Xuan · Mitch Hill · Junjie Bai · Guo-Jun Qi · Yalin Wang, ,https://arxiv.org/abs/2311.18303,,2311.18303.pdf,OmniMotionGPT: Animal Motion Generation with Limited Data,"Our paper aims to generate diverse and realistic animal motion sequences from +textual descriptions, without a large-scale animal text-motion dataset. While +the task of text-driven human motion synthesis is already extensively studied +and benchmarked, it remains challenging to transfer this success to other +skeleton structures with limited data. In this work, we design a model +architecture that imitates Generative Pretraining Transformer (GPT), utilizing +prior knowledge learned from human data to the animal domain. We jointly train +motion autoencoders for both animal and human motions and at the same time +optimize through the similarity scores among human motion encoding, animal +motion encoding, and text CLIP embedding. Presenting the first solution to this +problem, we are able to generate animal motions with high diversity and +fidelity, quantitatively and qualitatively outperforming the results of +training human motion generation baselines on animal data. Additionally, we +introduce AnimalML3D, the first text-animal motion dataset with 1240 animation +sequences spanning 36 different animal identities. We hope this dataset would +mediate the data scarcity problem in text-driven animal motion generation, +providing a new playground for the research community.",cs.CV,['cs.CV'] +Noisy-Correspondence Learning for Text-to-Image Person Re-identification,Yang Qin · Yingke Chen · Dezhong Peng · Xi Peng · Joey Tianyi Zhou · Peng Hu,https://github.com/QinYang79/RDE,https://arxiv.org/abs/2308.09911,,2308.09911.pdf,Noisy-Correspondence Learning for Text-to-Image Person Re-identification,"Text-to-image person re-identification (TIReID) is a compelling topic in the +cross-modal community, which aims to retrieve the target person based on a +textual query. Although numerous TIReID methods have been proposed and achieved +promising performance, they implicitly assume the training image-text pairs are +correctly aligned, which is not always the case in real-world scenarios. In +practice, the image-text pairs inevitably exist under-correlated or even +false-correlated, a.k.a noisy correspondence (NC), due to the low quality of +the images and annotation errors. To address this problem, we propose a novel +Robust Dual Embedding method (RDE) that can learn robust visual-semantic +associations even with NC. Specifically, RDE consists of two main components: +1) A Confident Consensus Division (CCD) module that leverages the dual-grained +decisions of dual embedding modules to obtain a consensus set of clean training +data, which enables the model to learn correct and reliable visual-semantic +associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional +Triplet Ranking loss with the hardest negative samples to a log-exponential +upper bound over all negative ones, thus preventing the model collapse under NC +and can also focus on hard-negative samples for promising performance. We +conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, +ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our +RDE. Our method achieves state-of-the-art results both with and without +synthetic noisy correspondences on all three datasets. Code is available at +https://github.com/QinYang79/RDE.",cs.CV,"['cs.CV', 'cs.MM']" +SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer,Rui Zhu · Yingwei Pan · Yehao Li · Ting Yao · Zhenglong Sun · Tao Mei · Chang-Wen Chen, ,https://arxiv.org/abs/2403.17004,,2403.17004.pdf,SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer,"Diffusion Transformer (DiT) has emerged as the new trend of generative +diffusion models on image generation. In view of extremely slow convergence in +typical DiT, recent breakthroughs have been driven by mask strategy that +significantly improves the training efficiency of DiT with additional +intra-image contextual learning. Despite this progress, mask strategy still +suffers from two inherent limitations: (a) training-inference discrepancy and +(b) fuzzy relations between mask reconstruction & generative diffusion process, +resulting in sub-optimal training of DiT. In this work, we address these +limitations by novelly unleashing the self-supervised discrimination knowledge +to boost DiT training. Technically, we frame our DiT in a teacher-student +manner. The teacher-student discriminative pairs are built on the diffusion +noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). +Instead of applying mask reconstruction loss over both DiT encoder and decoder, +we decouple DiT encoder and decoder to separately tackle discriminative and +generative objectives. In particular, by encoding discriminative pairs with +student and teacher DiT encoders, a new discriminative loss is designed to +encourage the inter-image alignment in the self-supervised embedding space. +After that, student samples are fed into student DiT decoder to perform the +typical generative diffusion task. Extensive experiments are conducted on +ImageNet dataset, and our method achieves a competitive balance between +training cost and generative capacity.",cs.CV,"['cs.CV', 'cs.MM']" +CoDeF: Content Deformation Fields for Temporally Consistent Video Processing,Hao Ouyang · Qiuyu Wang · Yuxi Xiao · Qingyan Bai · Juntao Zhang · Kecheng Zheng · Xiaowei Zhou · Qifeng Chen · Yujun Shen,https://qiuyu96.github.io/CoDeF/,https://arxiv.org/abs/2308.07926,,2308.07926.pdf,CoDeF: Content Deformation Fields for Temporally Consistent Video Processing,"We present the content deformation field CoDeF as a new type of video +representation, which consists of a canonical content field aggregating the +static contents in the entire video and a temporal deformation field recording +the transformations from the canonical image (i.e., rendered from the canonical +content field) to each individual frame along the time axis.Given a target +video, these two fields are jointly optimized to reconstruct it through a +carefully tailored rendering pipeline.We advisedly introduce some +regularizations into the optimization process, urging the canonical content +field to inherit semantics (e.g., the object shape) from the video.With such a +design, CoDeF naturally supports lifting image algorithms for video processing, +in the sense that one can apply an image algorithm to the canonical image and +effortlessly propagate the outcomes to the entire video with the aid of the +temporal deformation field.We experimentally show that CoDeF is able to lift +image-to-image translation to video-to-video translation and lift keypoint +detection to keypoint tracking without any training.More importantly, thanks to +our lifting strategy that deploys the algorithms on only one image, we achieve +superior cross-frame consistency in processed videos compared to existing +video-to-video translation approaches, and even manage to track non-rigid +objects like water and smog.Project page can be found at +https://qiuyu96.github.io/CoDeF/.",cs.CV,['cs.CV'] +Action Detection via an Image Diffusion Process,Lin Geng Foo · Tianjiao Li · Hossein Rahmani · Jun Liu, ,https://arxiv.org/abs/2404.01051,,2404.01051.pdf,Action Detection via an Image Diffusion Process,"Action detection aims to localize the starting and ending points of action +instances in untrimmed videos, and predict the classes of those instances. In +this paper, we make the observation that the outputs of the action detection +task can be formulated as images. Thus, from a novel perspective, we tackle +action detection via a three-image generation process to generate starting +point, ending point and action-class predictions as images via our proposed +Action Detection Image Diffusion (ADI-Diff) framework. Furthermore, since our +images differ from natural images and exhibit special properties, we further +explore a Discrete Action-Detection Diffusion Process and a Row-Column +Transformer design to better handle their processing. Our ADI-Diff framework +achieves state-of-the-art results on two widely-used datasets.",cs.CV,['cs.CV'] +T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory,Daehee Park · Jaeseok Jeong · Sung-Hoon Yoon · Jaewoo Jeong · Kuk-Jin Yoon, ,https://arxiv.org/abs/2403.10052,,2403.10052.pdf,T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory,"Trajectory prediction is a challenging problem that requires considering +interactions among multiple actors and the surrounding environment. While +data-driven approaches have been used to address this complex problem, they +suffer from unreliable predictions under distribution shifts during test time. +Accordingly, several online learning methods have been proposed using +regression loss from the ground truth of observed data leveraging the +auto-labeling nature of trajectory prediction task. We mainly tackle the +following two issues. First, previous works underfit and overfit as they only +optimize the last layer of the motion decoder. To this end, we employ the +masked autoencoder (MAE) for representation learning to encourage complex +interaction modeling in shifted test distribution for updating deeper layers. +Second, utilizing the sequential nature of driving data, we propose an +actor-specific token memory that enables the test-time learning of actor-wise +motion characteristics. Our proposed method has been validated across various +challenging cross-dataset distribution shift scenarios including nuScenes, +Lyft, Waymo, and Interaction. Our method surpasses the performance of existing +state-of-the-art online learning methods in terms of both prediction accuracy +and computational efficiency. The code is available at +https://github.com/daeheepark/T4P.",cs.CV,['cs.CV'] +Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation,Song Wang · Jiawei Yu · Wentong Li · Wenyu Liu · Xiaolu Liu · Junbo Chen · Jianke Zhu,https://github.com/songw-zju/HASSC,https://arxiv.org/abs/2404.11958,,2404.11958.pdf,Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation,"Semantic scene completion, also known as semantic occupancy prediction, can +provide dense geometric and semantic information for autonomous vehicles, which +attracts the increasing attention of both academia and industry. Unfortunately, +existing methods usually formulate this task as a voxel-wise classification +problem and treat each voxel equally in 3D space during training. As the hard +voxels have not been paid enough attention, the performance in some challenging +regions is limited. The 3D dense space typically contains a large number of +empty voxels, which are easy to learn but require amounts of computation due to +handling all the voxels uniformly for the existing models. Furthermore, the +voxels in the boundary region are more challenging to differentiate than those +in the interior. In this paper, we propose HASSC approach to train the semantic +scene completion model with hardness-aware design. The global hardness from the +network optimization process is defined for dynamical hard voxel selection. +Then, the local hardness with geometric anisotropy is adopted for voxel-wise +refinement. Besides, self-distillation strategy is introduced to make training +process stable and consistent. Extensive experiments show that our HASSC scheme +can effectively promote the accuracy of the baseline model without incurring +the extra inference cost. Source code is available at: +https://github.com/songw-zju/HASSC.",cs.CV,"['cs.CV', 'cs.RO']" +CogAgent: A Visual Language Model for GUI Agents,Wenyi Hong · Weihan Wang · Qingsong Lv · Jiazheng Xu · Wenmeng Yu · Junhui Ji · Yan Wang · Zihan Wang · Yuxiao Dong · Ming Ding · Jie Tang, ,https://arxiv.org/abs/2312.08914,,2312.08914.pdf,CogAgent: A Visual Language Model for GUI Agents,"People are spending an enormous amount of time on digital devices through +graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large +language models (LLMs) such as ChatGPT can assist people in tasks like writing +emails, but struggle to understand and interact with GUIs, thus limiting their +potential to increase automation levels. In this paper, we introduce CogAgent, +an 18-billion-parameter visual language model (VLM) specializing in GUI +understanding and navigation. By utilizing both low-resolution and +high-resolution image encoders, CogAgent supports input at a resolution of +1120*1120, enabling it to recognize tiny page elements and text. As a +generalist visual language model, CogAgent achieves the state of the art on +five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, +Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using +only screenshots as input, outperforms LLM-based methods that consume extracted +HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, +advancing the state of the art. The model and codes are available at +https://github.com/THUDM/CogVLM .",cs.CV,['cs.CV'] +Representing Signs as Language: A New Method for Sign Language Translation from Videos,Jia Gong · Lin Geng Foo · Yixuan He · Hossein Rahmani · Jun Liu, ,https://arxiv.org/abs/2404.00925,,2404.00925.pdf,LLMs are Good Sign Language Translators,"Sign Language Translation (SLT) is a challenging task that aims to translate +sign videos into spoken language. Inspired by the strong translation +capabilities of large language models (LLMs) that are trained on extensive +multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT. +In this paper, we regularize the sign videos to embody linguistic +characteristics of spoken language, and propose a novel SignLLM framework to +transform sign videos into a language-like representation for improved +readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The +Vector-Quantized Visual Sign module converts sign videos into a sequence of +discrete character-level sign tokens, and (2) the Codebook Reconstruction and +Alignment module converts these character-level tokens into word-level sign +representations using an optimal transport formulation. A sign-text alignment +loss further bridges the gap between sign and text tokens, enhancing semantic +compatibility. We achieve state-of-the-art gloss-free results on two +widely-used SLT benchmarks.",cs.CV,"['cs.CV', 'cs.CL']" +Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution,Zhikai Chen · Fuchen Long · Zhaofan Qiu · Ting Yao · Wengang Zhou · Jiebo Luo · Tao Mei, ,https://arxiv.org/abs/2403.17000,,2403.17000.pdf,Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution,"Diffusion models are just at a tipping point for image super-resolution task. +Nevertheless, it is not trivial to capitalize on diffusion models for video +super-resolution which necessitates not only the preservation of visual +appearance from low-resolution to high-resolution videos, but also the temporal +consistency across video frames. In this paper, we propose a novel approach, +pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video +super-resolution. SATeCo pivots on learning spatial-temporal guidance from +low-resolution videos to calibrate both latent-space high-resolution video +denoising and pixel-space video reconstruction. Technically, SATeCo freezes all +the parameters of the pre-trained UNet and VAE, and only optimizes two +deliberately-designed spatial feature adaptation (SFA) and temporal feature +alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame +features via adaptively estimating affine parameters for each pixel, +guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA +delves into feature interaction within a 3D local window (tubelet) through +self-attention, and executes cross-attention between tubelet and its +low-resolution counterpart to guide temporal feature alignment. Extensive +experiments conducted on the REDS4 and Vid4 datasets demonstrate the +effectiveness of our approach.",cs.CV,"['cs.CV', 'cs.MM']" +Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving,JINLONG LI · Baolu Li · Zhengzhong Tu · XINYU LIU · Qing Guo · Felix Juefei Xu · Runsheng Xu · Hongkai Yu, ,https://arxiv.org/abs/2404.04804,,2404.04804.pdf,Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving,"Vision-centric perception systems for autonomous driving have gained +considerable attention recently due to their cost-effectiveness and +scalability, especially compared to LiDAR-based systems. However, these systems +often struggle in low-light conditions, potentially compromising their +performance and safety. To address this, our paper introduces LightDiff, a +domain-tailored framework designed to enhance the low-light image quality for +autonomous driving applications. Specifically, we employ a multi-condition +controlled diffusion model. LightDiff works without any human-collected paired +data, leveraging a dynamic data degradation process instead. It incorporates a +novel multi-condition adapter that adaptively controls the input weights from +different modalities, including depth maps, RGB images, and text captions, to +effectively illuminate dark scenes while maintaining context consistency. +Furthermore, to align the enhanced images with the detection model's knowledge, +LightDiff employs perception-specific scores as rewards to guide the diffusion +training process through reinforcement learning. Extensive experiments on the +nuScenes datasets demonstrate that LightDiff can significantly improve the +performance of several state-of-the-art 3D detectors in night-time conditions +while achieving high visual quality scores, highlighting its potential to +safeguard autonomous driving.",cs.CV,['cs.CV'] +Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration,Chen Zhao · Weiling Cai · Chenyu Dong · Chengwei Hu,https://github.com/zhihefang,https://arxiv.org/abs/2311.16845,,2311.16845.pdf,Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration,"Underwater images are subject to intricate and diverse degradation, +inevitably affecting the effectiveness of underwater visual tasks. However, +most approaches primarily operate in the raw pixel space of images, which +limits the exploration of the frequency characteristics of underwater images, +leading to an inadequate utilization of deep models' representational +capabilities in producing high-quality images. In this paper, we introduce a +novel Underwater Image Enhancement (UIE) framework, named WF-Diff, designed to +fully leverage the characteristics of frequency domain information and +diffusion models. WF-Diff consists of two detachable networks: Wavelet-based +Fourier information interaction network (WFI2-net) and Frequency Residual +Diffusion Adjustment Module (FRDAM). With our full exploration of the frequency +domain information, WFI2-net aims to achieve preliminary enhancement of +frequency information in the wavelet space. Our proposed FRDAM can further +refine the high- and low-frequency information of the initial enhanced images, +which can be viewed as a plug-and-play universal module to adjust the detail of +the underwater images. With the above techniques, our algorithm can show SOTA +performance on real-world underwater image datasets, and achieves competitive +performance in visual quality.",cs.CV,['cs.CV'] +Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning,Jaewoo Jeong · Daehee Park · Kuk-Jin Yoon, ,https://arxiv.org/abs/2404.05218v1,,2404.05218v1.pdf,Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning,"Human pose forecasting garners attention for its diverse applications. +However, challenges in modeling the multi-modal nature of human motion and +intricate interactions among agents persist, particularly with longer +timescales and more agents. In this paper, we propose an interaction-aware +trajectory-conditioned long-term multi-agent human pose forecasting model, +utilizing a coarse-to-fine prediction approach: multi-modal global trajectories +are initially forecasted, followed by respective local pose forecasts +conditioned on each mode. In doing so, our Trajectory2Pose model introduces a +graph-based agent-wise interaction module for a reciprocal forecast of local +motion-conditioned global trajectory and trajectory-conditioned local pose. Our +model effectively handles the multi-modality of human motion and the complexity +of long-term multi-agent interactions, improving performance in complex +environments. Furthermore, we address the lack of long-term (6s+) multi-agent +(5+) datasets by constructing a new dataset from real-world images and 2D +annotations, enabling a comprehensive evaluation of our proposed model. +State-of-the-art prediction performance on both complex and simpler datasets +confirms the generalized effectiveness of our method. The code is available at +https://github.com/Jaewoo97/T2P.",cs.CV,"['cs.CV', 'cs.AI']" +Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains,Eunsu Baek · Keondo Park · Ji-yoon Kim · Hyung-Sin Kim,https://github.com/Edw2n/ImageNet-ES,https://arxiv.org/abs/2404.15882,,2404.15882.pdf,Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains,"Computer vision applications predict on digital images acquired by a camera +from physical scenes through light. However, conventional robustness benchmarks +rely on perturbations in digitized images, diverging from distribution shifts +occurring in the image acquisition process. To bridge this gap, we introduce a +new distribution shift dataset, ImageNet-ES, comprising variations in +environmental and camera sensor factors by directly capturing 202k images with +a real camera in a controllable testbed. With the new dataset, we evaluate +out-of-distribution (OOD) detection and model robustness. We find that existing +OOD detection methods do not cope with the covariate shifts in ImageNet-ES, +implying that the definition and detection of OOD should be revisited to +embrace real-world distribution shifts. We also observe that the model becomes +more robust in both ImageNet-C and -ES by learning environment and sensor +variations in addition to existing digital augmentations. Lastly, our results +suggest that effective shift mitigation via camera sensor control can +significantly improve performance without increasing model size. With these +findings, our benchmark may aid future research on robustness, OOD, and camera +sensor control for computer vision. Our code and dataset are available at +https://github.com/Edw2n/ImageNet-ES.",cs.CV,"['cs.CV', 'cs.AI']" +Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training,Xiaoyang Wu · Zhuotao Tian · Xin Wen · Bohao Peng · Xihui Liu · Kaicheng Yu · Hengshuang Zhao,https://github.com/Pointcept/Pointcept,https://arxiv.org/abs/2308.09718,,2308.09718.pdf,Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training,"The rapid advancement of deep learning models often attributes to their +ability to leverage massive training data. In contrast, such privilege has not +yet fully benefited 3D deep learning, mainly due to the limited availability of +large-scale 3D datasets. Merging multiple available data sources and letting +them collaboratively train a single model is a potential solution. However, due +to the large domain gap between 3D point cloud datasets, such mixed supervision +could adversely affect the model's performance and lead to degenerated +performance (i.e., negative transfer) compared to single-dataset training. In +view of this challenge, we introduce Point Prompt Training (PPT), a novel +framework for multi-dataset synergistic learning in the context of 3D +representation learning that supports multiple pre-training paradigms. Based on +this framework, we propose Prompt-driven Normalization, which adapts the model +to different datasets with domain-specific prompts and Language-guided +Categorical Alignment that decently unifies the multiple-dataset label spaces +by leveraging the relationship between label text. Extensive experiments verify +that PPT can overcome the negative transfer associated with synergistic +learning and produce generalizable representations. Notably, it achieves +state-of-the-art performance on each dataset using a single weight-shared model +with supervised multi-dataset training. Moreover, when served as a pre-training +framework, it outperforms other pre-training approaches regarding +representation quality and attains remarkable state-of-the-art performance +across over ten diverse downstream tasks spanning both indoor and outdoor 3D +scenarios.",cs.CV,['cs.CV'] +How to Train Neural Field Representations: A Comprehensive Study and Benchmark,Samuele Papa · Riccardo Valperga · David Knigge · Miltiadis Kofinas · Phillip Lippe · Jan-Jakob Sonke · Efstratios Gavves,https://fit-a-nef.github.io/,https://arxiv.org/abs/2312.10531,,2312.10531.pdf,How to Train Neural Field Representations: A Comprehensive Study and Benchmark,"Neural fields (NeFs) have recently emerged as a versatile method for modeling +signals of various modalities, including images, shapes, and scenes. +Subsequently, a number of works have explored the use of NeFs as +representations for downstream tasks, e.g. classifying an image based on the +parameters of a NeF that has been fit to it. However, the impact of the NeF +hyperparameters on their quality as downstream representation is scarcely +understood and remains largely unexplored. This is in part caused by the large +amount of time required to fit datasets of neural fields. + In this work, we propose $\verb|fit-a-nef|$, a JAX-based library that +leverages parallelization to enable fast optimization of large-scale NeF +datasets, resulting in a significant speed-up. With this library, we perform a +comprehensive study that investigates the effects of different hyperparameters +-- including initialization, network architecture, and optimization strategies +-- on fitting NeFs for downstream tasks. Our study provides valuable insights +on how to train NeFs and offers guidance for optimizing their effectiveness in +downstream applications. Finally, based on the proposed library and our +analysis, we propose Neural Field Arena, a benchmark consisting of neural field +variants of popular vision datasets, including MNIST, CIFAR, variants of +ImageNet, and ShapeNetv2. Our library and the Neural Field Arena will be +open-sourced to introduce standardized benchmarking and promote further +research on neural fields.",cs.CV,['cs.CV'] +Towards Memorization-Free Diffusion Models,Chen Chen · Daochang Liu · Chang Xu,https://chenchen-usyd.github.io/AMG-Project-Page/,https://arxiv.org/abs/2404.00922,,2404.00922.pdf,Towards Memorization-Free Diffusion Models,"Pretrained diffusion models and their outputs are widely accessible due to +their exceptional capacity for synthesizing high-quality images and their +open-source nature. The users, however, may face litigation risks owing to the +models' tendency to memorize and regurgitate training data during inference. To +address this, we introduce Anti-Memorization Guidance (AMG), a novel framework +employing three targeted guidance strategies for the main causes of +memorization: image and caption duplication, and highly specific user prompts. +Consequently, AMG ensures memorization-free outputs while maintaining high +image quality and text alignment, leveraging the synergy of its guidance +methods, each indispensable in its own right. AMG also features an innovative +automatic detection system for potential memorization during each step of +inference process, allows selective application of guidance strategies, +minimally interfering with the original sampling process to preserve output +utility. We applied AMG to pretrained Denoising Diffusion Probabilistic Models +(DDPM) and Stable Diffusion across various generation tasks. The results +demonstrate that AMG is the first approach to successfully eradicates all +instances of memorization with no or marginal impacts on image quality and +text-alignment, as evidenced by FID and CLIP scores.",cs.CV,['cs.CV'] +Gradient Alignment for Cross-domain Face Anti-Spoofing,MINH BINH LE · Simon Woo,https://github.com/Leminhbinh0209/CVPR24-FAS,https://arxiv.org/abs/2402.18817,,2402.18817.pdf,Gradient Alignment for Cross-Domain Face Anti-Spoofing,"Recent advancements in domain generalization (DG) for face anti-spoofing +(FAS) have garnered considerable attention. Traditional methods have focused on +designing learning objectives and additional modules to isolate domain-specific +features while retaining domain-invariant characteristics in their +representations. However, such approaches often lack guarantees of consistent +maintenance of domain-invariant features or the complete removal of +domain-specific features. Furthermore, most prior works of DG for FAS do not +ensure convergence to a local flat minimum, which has been shown to be +advantageous for DG. In this paper, we introduce GAC-FAS, a novel learning +objective that encourages the model to converge towards an optimal flat minimum +without necessitating additional learning modules. Unlike conventional +sharpness-aware minimizers, GAC-FAS identifies ascending points for each domain +and regulates the generalization gradient updates at these points to align +coherently with empirical risk minimization (ERM) gradient updates. This unique +approach specifically guides the model to be robust against domain shifts. We +demonstrate the efficacy of GAC-FAS through rigorous testing on challenging +cross-domain FAS datasets, where it establishes state-of-the-art performance. +The code is available at https://github.com/leminhbinh0209/CVPR24-FAS.",cs.CV,['cs.CV'] +MANUS: Markerless Grasp Capture using Articulated 3D Gaussians,Chandradeep Pokhariya · Ishaan Shah · Angela Xing · Zekun Li · Kefan Chen · Avinash Sharma · Srinath Sridhar,https://ivl.cs.brown.edu/research/manus.html,https://arxiv.org/abs/2312.02137,,2312.02137.pdf,MANUS: Markerless Grasp Capture using Articulated 3D Gaussians,"Understanding how we grasp objects with our hands has important applications +in areas like robotics and mixed reality. However, this challenging problem +requires accurate modeling of the contact between hands and objects. To capture +grasps, existing methods use skeletons, meshes, or parametric models that does +not represent hand shape accurately resulting in inaccurate contacts. We +present MANUS, a method for Markerless Hand-Object Grasp Capture using +Articulated 3D Gaussians. We build a novel articulated 3D Gaussians +representation that extends 3D Gaussian splatting for high-fidelity +representation of articulating hands. Since our representation uses Gaussian +primitives, it enables us to efficiently and accurately estimate contacts +between the hand and the object. For the most accurate results, our method +requires tens of camera views that current datasets do not provide. We +therefore build MANUS-Grasps, a new dataset that contains hand-object grasps +viewed from 50+ cameras across 30+ scenes, 3 subjects, and comprising over 7M +frames. In addition to extensive qualitative results, we also show that our +method outperforms others on a quantitative contact evaluation method that uses +paint transfer from the object to the hand.",cs.CV,['cs.CV'] +Language-guided Image Reflection Separation,Haofeng Zhong · Yuchen Hong · Shuchen Weng · Jinxiu Liang · Boxin Shi, ,https://arxiv.org/abs/2402.11874,,2402.11874.pdf,Language-guided Image Reflection Separation,"This paper studies the problem of language-guided reflection separation, +which aims at addressing the ill-posed reflection separation problem by +introducing language descriptions to provide layer content. We propose a +unified framework to solve this problem, which leverages the cross-attention +mechanism with contrastive learning strategies to construct the correspondence +between language descriptions and image layers. A gated network design and a +randomized training strategy are employed to tackle the recognizable layer +ambiguity. The effectiveness of the proposed method is validated by the +significant performance advantage over existing reflection separation methods +on both quantitative and qualitative comparisons.",cs.CV,['cs.CV'] +A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion,Feng Yu · Teng Zhang · Gilad Lerman, ,https://arxiv.org/abs/2404.11590,,2404.11590.pdf,A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion,"We present the subspace-constrained Tyler's estimator (STE) designed for +recovering a low-dimensional subspace within a dataset that may be highly +corrupted with outliers. STE is a fusion of the Tyler's M-estimator (TME) and a +variant of the fast median subspace. Our theoretical analysis suggests that, +under a common inlier-outlier model, STE can effectively recover the underlying +subspace, even when it contains a smaller fraction of inliers relative to other +methods in the field of robust subspace recovery. We apply STE in the context +of Structure from Motion (SfM) in two ways: for robust estimation of the +fundamental matrix and for the removal of outlying cameras, enhancing the +robustness of the SfM pipeline. Numerical experiments confirm the +state-of-the-art performance of our method in these applications. This research +makes significant contributions to the field of robust subspace recovery, +particularly in the context of computer vision and 3D reconstruction.",cs.CV,['cs.CV'] +CNC-Net: Self-Supervised Learning for CNC Machining Operations,Mohsen Yavartanoo · Sangmin Hong · Reyhaneh Neshatavar · Kyoung Mu Lee,https://github.com/myavartanoo/CNC-Net_PyTorch,https://arxiv.org/abs/2312.09925,,2312.09925.pdf,CNC-Net: Self-Supervised Learning for CNC Machining Operations,"CNC manufacturing is a process that employs computer numerical control (CNC) +machines to govern the movements of various industrial tools and machinery, +encompassing equipment ranging from grinders and lathes to mills and CNC +routers. However, the reliance on manual CNC programming has become a +bottleneck, and the requirement for expert knowledge can result in significant +costs. Therefore, we introduce a pioneering approach named CNC-Net, +representing the use of deep neural networks (DNNs) to simulate CNC machines +and grasp intricate operations when supplied with raw materials. CNC-Net +constitutes a self-supervised framework that exclusively takes an input 3D +model and subsequently generates the essential operation parameters required by +the CNC machine to construct the object. Our method has the potential to +transformative automation in manufacturing by offering a cost-effective +alternative to the high costs of manual CNC programming while maintaining +exceptional precision in 3D object production. Our experiments underscore the +effectiveness of our CNC-Net in constructing the desired 3D objects through the +utilization of CNC operations. Notably, it excels in preserving finer local +details, exhibiting a marked enhancement in precision compared to the +state-of-the-art 3D CAD reconstruction approaches.",cs.CV,['cs.CV'] +Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation,Xiyi Chen · Marko Mihajlovic · Shaofei Wang · Sergey Prokudin · Siyu Tang, ,https://arxiv.org/abs/2401.04728,,2401.04728.pdf,Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation,"Recent advances in generative diffusion models have enabled the previously +unfeasible capability of generating 3D assets from a single input image or a +text prompt. In this work, we aim to enhance the quality and functionality of +these models for the task of creating controllable, photorealistic human +avatars. We achieve this by integrating a 3D morphable model into the +state-of-the-art multi-view-consistent diffusion approach. We demonstrate that +accurate conditioning of a generative pipeline on the articulated 3D model +enhances the baseline model performance on the task of novel view synthesis +from a single image. More importantly, this integration facilitates a seamless +and accurate incorporation of facial expression and body pose control into the +generation process. To the best of our knowledge, our proposed framework is the +first diffusion model to enable the creation of fully 3D-consistent, +animatable, and photorealistic human avatars from a single image of an unseen +subject; extensive quantitative and qualitative evaluations demonstrate the +advantages of our approach over existing state-of-the-art avatar creation +models on both novel view and novel expression synthesis tasks. The code for +our project is publicly available.",cs.CV,"['cs.CV', 'cs.AI']" +3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting,Zhiyin Qian · Shaofei Wang · Marko Mihajlovic · Andreas Geiger · Siyu Tang, ,https://arxiv.org/abs/2312.09228,,2312.09228.pdf,3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting,"We introduce an approach that creates animatable human avatars from monocular +videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural +radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image +synthesis but often require days of training, and are extremely slow at +inference time. Recently, the community has explored fast grid structures for +efficient training of clothed avatars. Albeit being extremely fast at training, +these methods can barely achieve an interactive rendering frame rate with +around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a +non-rigid deformation network to reconstruct animatable clothed human avatars +that can be trained within 30 minutes and rendered at real-time frame rates +(50+ FPS). Given the explicit nature of our representation, we further +introduce as-isometric-as-possible regularizations on both the Gaussian mean +vectors and the covariance matrices, enhancing the generalization of our model +on highly articulated unseen poses. Experimental results show that our method +achieves comparable and even better performance compared to state-of-the-art +approaches on animatable avatar creation from a monocular input, while being +400x and 250x faster in training and inference, respectively.",cs.CV,['cs.CV'] +TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models,Yushi Huang · Ruihao Gong · Jing Liu · Tianlong Chen · Xianglong Liu,https://github.com/ModelTC/TFMQ-DM,https://arxiv.org/abs/2311.16503,,2311.16503.pdf,TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models,"The Diffusion model, a prevalent framework for image generation, encounters +significant challenges in terms of broad applicability due to its extended +inference times and substantial memory requirements. Efficient Post-training +Quantization (PTQ) is pivotal for addressing these issues in traditional +models. Different from traditional models, diffusion models heavily depend on +the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ +from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a +few modules totally irrespective of the sampling data. However, existing PTQ +methods do not optimize these modules separately. They adopt inappropriate +reconstruction targets and complex calibration methods, resulting in a severe +disturbance of the temporal feature and denoising trajectory, as well as a low +compression efficiency. To solve these, we propose a Temporal Feature +Maintenance Quantization (TFMQ) framework building upon a Temporal Information +Block which is just related to the time-step $t$ and unrelated to the sampling +data. Powered by the pioneering block design, we devise temporal information +aware reconstruction (TIAR) and finite set calibration (FSC) to align the +full-precision temporal features in a limited time. Equipped with the +framework, we can maintain the most temporal information and ensure the +end-to-end generation quality. Extensive experiments on various datasets and +diffusion models prove our state-of-the-art results. Remarkably, our +quantization approach, for the first time, achieves model performance nearly on +par with the full-precision model under 4-bit weight quantization. +Additionally, our method incurs almost no extra computational cost and +accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ +compared to previous works. Our code is publicly available at +https://github.com/ModelTC/TFMQ-DM.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +"Point Transformer V3: Simpler, Faster, Stronger",Xiaoyang Wu · Li Jiang · Peng-Shuai Wang · Zhijian Liu · Xihui Liu · Yu Qiao · Wanli Ouyang · Tong He · Hengshuang Zhao,https://github.com/Pointcept/PointTransformerV3,https://arxiv.org/abs/2312.10035,,2312.10035.pdf,"Point Transformer V3: Simpler, Faster, Stronger","This paper is not motivated to seek innovation within the attention +mechanism. Instead, it focuses on overcoming the existing trade-offs between +accuracy and efficiency within the context of point cloud processing, +leveraging the power of scale. Drawing inspiration from recent advances in 3D +large-scale representation learning, we recognize that model performance is +more influenced by scale than by intricate design. Therefore, we present Point +Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the +accuracy of certain mechanisms that are minor to the overall performance after +scaling, such as replacing the precise neighbor search by KNN with an efficient +serialized neighbor mapping of point clouds organized with specific patterns. +This principle enables significant scaling, expanding the receptive field from +16 to 1024 points while remaining efficient (a 3x increase in processing speed +and a 10x improvement in memory efficiency compared with its predecessor, +PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that +span both indoor and outdoor scenarios. Further enhanced with multi-dataset +joint training, PTv3 pushes these results to a higher level.",cs.CV,['cs.CV'] +Efficient Stitchable Task Adaptation,Haoyu He · Zizheng Pan · Jing Liu · Jianfei Cai · Bohan Zhuang, ,https://arxiv.org/abs/2311.17352,,2311.17352.pdf,Efficient Stitchable Task Adaptation,"The paradigm of pre-training and fine-tuning has laid the foundation for +deploying deep learning models. However, most fine-tuning methods are designed +to meet a specific resource budget. Recently, considering diverse deployment +scenarios with various resource budgets, stitchable neural network (SN-Net) is +introduced to quickly obtain numerous new networks (stitches) from the +pre-trained models (anchors) in a model family via model stitching. Although +promising, SN-Net confronts new challenges when adapting it to new target +domains, including huge memory and storage requirements and a long and +sub-optimal multistage adaptation process. In this work, we present a novel +framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce +a palette of fine-tuned models that adhere to diverse resource constraints. +Specifically, we first tailor parameter-efficient fine-tuning to share low-rank +updates among the stitches while maintaining independent bias terms. In this +way, we largely reduce fine-tuning memory burdens and mitigate the interference +among stitches that arises in task adaptation. Furthermore, we streamline a +simple yet effective one-stage deployment pipeline, which estimates the +important stitches to deploy with training-time gradient statistics. By +assigning higher sampling probabilities to important stitches, we also get a +boosted Pareto frontier. Extensive experiments on 25 downstream visual +recognition tasks demonstrate that our ESTA is capable of generating stitches +with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net +adaptation by remarkable margins with significantly lower training time and +fewer trainable parameters. Furthermore, we demonstrate the flexibility and +scalability of our ESTA framework by stitching LLMs from LLaMA family, +obtaining chatbot stitches of assorted sizes.",cs.LG,"['cs.LG', 'cs.CL', 'cs.CV']" +CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement,Qiang Zhu · Jinhua Hao · Yukang Ding · Yu Liu · Qiao Mo · Ming Sun · Chao Zhou · Shuyuan Zhu, ,https://arxiv.org/abs/2403.10362,,2403.10362.pdf,CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement,"Recently, numerous approaches have achieved notable success in compressed +video quality enhancement (VQE). However, these methods usually ignore the +utilization of valuable coding priors inherently embedded in compressed videos, +such as motion vectors and residual frames, which carry abundant temporal and +spatial information. To remedy this problem, we propose the Coding +Priors-Guided Aggregation (CPGA) network to utilize temporal and spatial +information from coding priors. The CPGA mainly consists of an inter-frame +temporal aggregation (ITA) module and a multi-scale non-local aggregation (MNA) +module. Specifically, the ITA module aggregates temporal information from +consecutive frames and coding priors, while the MNA module globally captures +spatial information guided by residual frames. In addition, to facilitate +research in VQE task, we newly construct the Video Coding Priors (VCP) dataset, +comprising 300 videos with various coding priors extracted from corresponding +bitstreams. It remedies the shortage of previous datasets on the lack of coding +information. Experimental results demonstrate the superiority of our method +compared to existing state-of-the-art methods. The code and dataset will be +released at https://github.com/CPGA/CPGA.git.",eess.IV,"['eess.IV', 'cs.CV']" +WonderJourney: Going from Anywhere to Everywhere,Hong-Xing Yu · Haoyi Duan · Junhwa Hur · Kyle Sargent · Michael Rubinstein · William Freeman · Forrester Cole · Deqing Sun · Noah Snavely · Jiajun Wu · Charles Herrmann, ,https://arxiv.org/abs/2312.03884,,,WonderJourney: Going from Anywhere to Everywhere,"We introduce WonderJourney, a modularized framework for perpetual 3D scene +generation. Unlike prior work on view generation that focuses on a single type +of scenes, we start at any user-provided location (by a text description or an +image) and generate a journey through a long sequence of diverse yet coherently +connected 3D scenes. We leverage an LLM to generate textual descriptions of the +scenes in this journey, a text-driven point cloud generation pipeline to make a +compelling and coherent sequence of 3D scenes, and a large VLM to verify the +generated scenes. We show compelling, diverse visual results across various +scene types and styles, forming imaginary ""wonderjourneys"". Project website: +https://kovenyu.com/WonderJourney/",cs.CV,"['cs.CV', 'cs.GR']" +Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge,Dongjin Kim · Sung Jin Um · Sangmin Lee · Jung Uk Kim, ,https://arxiv.org/abs/2403.17420,,2403.17420.pdf,Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge,"The goal of the multi-sound source localization task is to localize sound +sources from the mixture individually. While recent multi-sound source +localization methods have shown improved performance, they face challenges due +to their reliance on prior information about the number of objects to be +separated. In this paper, to overcome this limitation, we present a novel +multi-sound source localization method that can perform localization without +prior knowledge of the number of sound sources. To achieve this goal, we +propose an iterative object identification (IOI) module, which can recognize +sound-making objects in an iterative manner. After finding the regions of +sound-making objects, we devise object similarity-aware clustering (OSC) loss +to guide the IOI module to effectively combine regions of the same object but +also distinguish between different objects and backgrounds. It enables our +method to perform accurate localization of sound-making objects without any +prior knowledge. Extensive experimental results on the MUSIC and VGGSound +benchmarks show the significant performance improvements of the proposed method +over the existing methods for both single and multi-source. Our code is +available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']" +Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation,Qiyuan Dai · Sibei Yang, ,,https://paperswithcode.com/paper/curriculum-point-prompting-for-weakly,,,,,nan +Osprey: Pixel Understanding with Visual Instruction Tuning,Yuqian Yuan · Wentong Li · Jian liu · Dongqi Tang · Xinjie Luo · Chi Qin · Lei Zhang · Jianke Zhu,https://github.com/CircleRadon/Osprey,https://arxiv.org/abs/2312.10032,,2312.10032.pdf,Osprey: Pixel Understanding with Visual Instruction Tuning,"Multimodal large language models (MLLMs) have recently achieved impressive +general-purpose vision-language capabilities through visual instruction tuning. +However, current MLLMs primarily focus on image-level or box-level +understanding, falling short in achieving fine-grained vision-language +alignment at pixel level. Besides, the lack of mask-based instruction data +limits their advancements. In this paper, we propose Osprey, a mask-text +instruction tuning approach, to extend MLLMs by incorporating fine-grained mask +regions into language instruction, aiming at achieving pixel-wise visual +understanding. To achieve this goal, we first meticulously curate a mask-based +region-text dataset with 724K samples, and then design a vision-language model +by injecting pixel-level representation into LLM. Specifically, Osprey adopts a +convolutional CLIP backbone as the vision encoder and employs a mask-aware +visual extractor to extract precise visual mask features from high resolution +input. Experimental results demonstrate Osprey's superiority in various region +understanding tasks, showcasing its new capability for pixel-level instruction +tuning. In particular, Osprey can be integrated with Segment Anything Model +(SAM) seamlessly to obtain multi-granularity semantics. The source code, +dataset and demo can be found at https://github.com/CircleRadon/Osprey.",cs.CV,['cs.CV'] +Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs,Hao Fei · Shengqiong Wu · Wei Ji · Hanwang Zhang · Tat-seng Chua,http://haofei.vip/Dysen-VDM/,https://arxiv.org/abs/2308.13812,,2308.13812.pdf,Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs,"Text-to-video (T2V) synthesis has gained increasing attention in the +community, in which the recently emerged diffusion models (DMs) have +promisingly shown stronger performance than the past approaches. While existing +state-of-the-art DMs are competent to achieve high-resolution video generation, +they may largely suffer from key limitations (e.g., action occurrence +disorders, crude video motions) with respect to the intricate temporal dynamics +modeling, one of the crux of video synthesis. In this work, we investigate +strengthening the awareness of video dynamics for DMs, for high-quality T2V +generation. Inspired by human intuition, we design an innovative dynamic scene +manager (dubbed as Dysen) module, which includes (step-1) extracting from input +text the key actions with proper time-order arrangement, (step-2) transforming +the action schedules into the dynamic scene graph (DSG) representations, and +(step-3) enriching the scenes in the DSG with sufficient and reasonable +details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via +in-context learning, Dysen realizes (nearly) human-level temporal dynamics +understanding. Finally, the resulting video DSG with rich action scene details +is encoded as fine-grained spatio-temporal features, integrated into the +backbone T2V DM for video generating. Experiments on popular T2V datasets +suggest that our Dysen-VDM consistently outperforms prior arts with significant +margins, especially in scenarios with complex actions. Codes at +https://haofei.vip/Dysen-VDM",cs.AI,"['cs.AI', 'cs.CV']" +RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D,Lingteng Qiu · Guanying Chen · Xiaodong Gu · Qi Zuo · Mutian Xu · Yushuang Wu · Weihao Yuan · Zilong Dong · Liefeng Bo · Xiaoguang Han,https://aigc3d.github.io/richdreamer/,https://arxiv.org/abs/2311.16918v1,,2311.16918v1.pdf,RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D,"Lifting 2D diffusion for 3D generation is a challenging problem due to the +lack of geometric prior and the complex entanglement of materials and lighting +in natural images. Existing methods have shown promise by first creating the +geometry through score-distillation sampling (SDS) applied to rendered surface +normals, followed by appearance modeling. However, relying on a 2D RGB +diffusion model to optimize surface normals is suboptimal due to the +distribution discrepancy between natural images and normals maps, leading to +instability in optimization. In this paper, recognizing that the normal and +depth information effectively describe scene geometry and be automatically +estimated from images, we propose to learn a generalizable Normal-Depth +diffusion model for 3D generation. We achieve this by training on the +large-scale LAION dataset together with the generalizable image-to-depth and +normal prior models. In an attempt to alleviate the mixed illumination effects +in the generated materials, we introduce an albedo diffusion model to impose +data-driven constraints on the albedo component. Our experiments show that when +integrated into existing text-to-3D pipelines, our models significantly enhance +the detail richness, achieving state-of-the-art results. Our project page is +https://lingtengqiu.github.io/RichDreamer/.",cs.CV,"['cs.CV', 'cs.AI']" +Towards Generalizing to Unseen Domains with Few Labels,Chamuditha Jayanga Galappaththige · Sanoojan Baliah · Malitha Gunawardhana · Muhammad Haris Khan, ,https://arxiv.org/abs/2403.11674,,2403.11674.pdf,Towards Generalizing to Unseen Domains with Few Labels,"We approach the challenge of addressing semi-supervised domain generalization +(SSDG). Specifically, our aim is to obtain a model that learns +domain-generalizable features by leveraging a limited subset of labelled data +alongside a substantially larger pool of unlabeled data. Existing domain +generalization (DG) methods which are unable to exploit unlabeled data perform +poorly compared to semi-supervised learning (SSL) methods under SSDG setting. +Nevertheless, SSL methods have considerable room for performance improvement +when compared to fully-supervised DG training. To tackle this underexplored, +yet highly practical problem of SSDG, we make the following core contributions. +First, we propose a feature-based conformity technique that matches the +posterior distributions from the feature space with the pseudo-label from the +model's output space. Second, we develop a semantics alignment loss to learn +semantically-compatible representations by regularizing the semantic structure +in the feature space. Our method is plug-and-play and can be readily integrated +with different SSL-based SSDG baselines without introducing any additional +parameters. Extensive experimental results across five challenging DG +benchmarks with four strong SSL baselines suggest that our method provides +consistent and notable gains in two different SSDG settings.",cs.CV,['cs.CV'] +SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving,Yiming Xie · Henglu Wei · Zhenyi Liu · Xiaoyu Wang · Xiangyang Ji, ,https://arxiv.org/abs/2403.17094,,2403.17094.pdf,SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving,"To advance research in learning-based defogging algorithms, various synthetic +fog datasets have been developed. However, existing datasets created using the +Atmospheric Scattering Model (ASM) or real-time rendering engines often +struggle to produce photo-realistic foggy images that accurately mimic the +actual imaging process. This limitation hinders the effective generalization of +models from synthetic to real data. In this paper, we introduce an end-to-end +simulation pipeline designed to generate photo-realistic foggy images. This +pipeline comprehensively considers the entire physically-based foggy scene +imaging process, closely aligning with real-world image capture methods. Based +on this pipeline, we present a new synthetic fog dataset named SynFog, which +features both sky light and active lighting conditions, as well as three levels +of fog density. Experimental results demonstrate that models trained on SynFog +exhibit superior performance in visual perception and detection accuracy +compared to others when applied to real-world foggy images.",cs.CV,"['cs.CV', 'cs.LG']" +FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication,Eric Slyman · Stefan Lee · Scott Cohen · Kushal Kafle,https://ericslyman.com/fairdedup/,https://arxiv.org/abs/2404.16123,,2404.16123.pdf,FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication,"Recent dataset deduplication techniques have demonstrated that content-aware +dataset pruning can dramatically reduce the cost of training Vision-Language +Pretrained (VLP) models without significant performance losses compared to +training on the original dataset. These results have been based on pruning +commonly used image-caption datasets collected from the web -- datasets that +are known to harbor harmful social biases that may then be codified in trained +models. In this work, we evaluate how deduplication affects the prevalence of +these biases in the resulting trained models and introduce an easy-to-implement +modification to the recent SemDeDup algorithm that can reduce the negative +effects that we observe. When examining CLIP-style models trained on +deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm +consistently leads to improved fairness metrics over SemDeDup on the FairFace +and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'I.4.10; I.2.7; E.0']" +ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation,Khoi D Nguyen · Chen Li · Gim Hee Lee, ,https://arxiv.org/abs/2403.13647,,,Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,"Category-agnostic pose estimation (CAPE) aims to predict keypoints for +arbitrary classes given a few support images annotated with keypoints. Existing +methods only rely on the features extracted at support keypoints to predict or +refine the keypoints on query image, but a few support feature vectors are +local and inadequate for CAPE. Considering that human can quickly perceive +potential keypoints of arbitrary objects, we propose a novel framework for CAPE +based on such potential keypoints (named as meta-points). Specifically, we +maintain learnable embeddings to capture inherent information of various +keypoints, which interact with image feature maps to produce meta-points +without any support. The produced meta-points could serve as meaningful +potential keypoints for CAPE. Due to the inevitable gap between inherency and +annotation, we finally utilize the identities and details offered by support +keypoints to assign and refine meta-points to desired keypoints in query image. +In addition, we propose a progressive deformable point decoder and a slacked +regression loss for better prediction and supervision. Our novel framework not +only reveals the inherency of keypoints but also outperforms existing methods +of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100 +dataset demonstrate the effectiveness of our framework.",cs.CV,['cs.CV'] +SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation,Jiehong Lin · lihua liu · Dekun Lu · Kui Jia, ,https://arxiv.org/abs/2311.15707,,2311.15707.pdf,SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation,"Zero-shot 6D object pose estimation involves the detection of novel objects +with their 6D poses in cluttered scenes, presenting significant challenges for +model generalizability. Fortunately, the recent Segment Anything Model (SAM) +has showcased remarkable zero-shot transfer performance, which provides a +promising solution to tackle this task. Motivated by this, we introduce SAM-6D, +a novel framework designed to realize the task through two steps, including +instance segmentation and pose estimation. Given the target objects, SAM-6D +employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) +and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D +images. ISM takes SAM as an advanced starting point to generate all possible +object proposals and selectively preserves valid ones through meticulously +crafted object matching scores in terms of semantics, appearance and geometry. +By treating pose estimation as a partial-to-partial point matching problem, PEM +performs a two-stage point matching process featuring a novel design of +background tokens to construct dense 3D-3D correspondence, ultimately yielding +the pose estimates. Without bells and whistles, SAM-6D outperforms the existing +methods on the seven core datasets of the BOP Benchmark for both instance +segmentation and pose estimation of novel objects.",cs.CV,['cs.CV'] +Test-Time Zero-Shot Temporal Action Localization,Benedetta Liberatori · Alessandro Conti · Paolo Rota · Yiming Wang · Elisa Ricci, ,https://arxiv.org/abs/2404.05426,,2404.05426.pdf,Test-Time Zero-Shot Temporal Action Localization,"Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate +actions in untrimmed videos unseen during training. Existing ZS-TAL methods +involve fine-tuning a model on a large amount of annotated training data. While +effective, training-based ZS-TAL approaches assume the availability of labeled +data for supervised learning, which can be impractical in some applications. +Furthermore, the training process naturally induces a domain bias into the +learned model, which may adversely affect the model's generalization ability to +arbitrary videos. These considerations prompt us to approach the ZS-TAL problem +from a radically novel perspective, relaxing the requirement for training data. +To this aim, we introduce a novel method that performs Test-Time adaptation for +Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained +Vision and Language Model (VLM). T3AL operates in three steps. First, a +video-level pseudo-label of the action category is computed by aggregating +information from the entire video. Then, action localization is performed +adopting a novel procedure inspired by self-supervised learning. Finally, +frame-level textual descriptions extracted with a state-of-the-art captioning +model are employed for refining the action region proposals. We validate the +effectiveness of T3AL by conducting experiments on the THUMOS14 and the +ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly +outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the +benefit of a test-time adaptation approach.",cs.CV,['cs.CV'] +De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts,Yuzheng Wang · Dingkang Yang · Zhaoyu Chen · Yang Liu · Siao Liu · Wenqiang Zhang · Lihua Zhang · Lizhe Qi, ,https://arxiv.org/abs/2403.19539,,,De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts,"Data-Free Knowledge Distillation (DFKD) is a promising task to train +high-performance small models to enhance actual deployment without relying on +the original training data. Existing methods commonly avoid relying on private +data by utilizing synthetic or sampled data. However, a long-overlooked issue +is that the severe distribution shifts between their substitution and original +data, which manifests as huge differences in the quality of images and class +proportions. The harmful shifts are essentially the confounder that +significantly causes performance bottlenecks. To tackle the issue, this paper +proposes a novel perspective with causal inference to disentangle the student +models from the impact of such shifts. By designing a customized causal graph, +we first reveal the causalities among the variables in the DFKD task. +Subsequently, we propose a Knowledge Distillation Causal Intervention (KDCI) +framework based on the backdoor adjustment to de-confound the confounder. KDCI +can be flexibly combined with most existing state-of-the-art baselines. +Experiments in combination with six representative DFKD methods demonstrate the +effectiveness of our KDCI, which can obviously help existing methods under +almost all settings, \textit{e.g.}, improving the baseline by up to 15.54\% +accuracy on the CIFAR-100 dataset.",cs.CV,['cs.CV'] +DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields,Cheng-You Lu · Peisen Zhou · Angela Xing · Chandradeep Pokhariya · Arnab Dey · Ishaan Shah · Rugved Mavidipalli · Dylan Hu · Andrew Comport · Kefan Chen · Srinath Sridhar, ,https://arxiv.org/abs/2307.16897,,2307.16897.pdf,DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields,"Advances in neural fields are enabling high-fidelity capture of the shape and +appearance of dynamic 3D scenes. However, their capabilities lag behind those +offered by conventional representations such as 2D videos because of +algorithmic challenges and the lack of large-scale multi-view real-world +datasets. We address the dataset limitation with DiVa-360, a real-world 360 +dynamic visual dataset that contains synchronized high-resolution and +long-duration multi-view video sequences of table-scale scenes captured using a +customized low-cost system with 53 cameras. It contains 21 object-centric +sequences categorized by different motion types, 25 intricate hand-object +interaction sequences, and 8 long-duration sequences for a total of 17.4 M +image frames. In addition, we provide foreground-background segmentation masks, +synchronized audio, and text descriptions. We benchmark the state-of-the-art +dynamic neural field methods on DiVa-360 and provide insights about existing +methods and future challenges on long-duration neural field capture.",cs.CV,"['cs.CV', 'cs.AI']" +When StyleGAN Meets Stable Diffusion: a ${\mathcal{W}_+}$ Adapter for Personalized Image Generation,Xiaoming Li · Xinyu Hou · Chen Change Loy,https://github.com/csxmli2016/w-plus-adapter,https://arxiv.org/abs/2311.17461v1,,2311.17461v1.pdf,When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation,"Text-to-image diffusion models have remarkably excelled in producing diverse, +high-quality, and photo-realistic images. This advancement has spurred a +growing interest in incorporating specific identities into generated content. +Most current methods employ an inversion approach to embed a target visual +concept into the text embedding space using a single reference image. However, +the newly synthesized faces either closely resemble the reference image in +terms of facial attributes, such as expression, or exhibit a reduced capacity +for identity preservation. Text descriptions intended to guide the facial +attributes of the synthesized face may fall short, owing to the intricate +entanglement of identity information with identity-irrelevant facial attributes +derived from the reference image. To address these issues, we present the novel +use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve +enhanced identity preservation and disentanglement for diffusion models. By +aligning this semantically meaningful human face latent space with +text-to-image diffusion models, we succeed in maintaining high fidelity in +identity preservation, coupled with the capacity for semantic editing. +Additionally, we propose new training objectives to balance the influences of +both prompt and identity conditions, ensuring that the identity-irrelevant +background remains unaffected during facial attribute modifications. Extensive +experiments reveal that our method adeptly generates personalized text-to-image +outputs that are not only compatible with prompt descriptions but also amenable +to common StyleGAN editing directions in diverse settings. Our source code will +be available at \url{https://github.com/csxmli2016/w-plus-adapter}.",cs.CV,['cs.CV'] +Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset,Yujin Jeon · Eunsue Choi · Youngchan Kim · Yunseong Moon · Khalid Omer · Felix Heide · Seung-Hwan Baek, ,https://arxiv.org/abs/2311.17396,,2311.17396.pdf,Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset,"Image datasets are essential not only in validating existing methods in +computer vision but also in developing new methods. Most existing image +datasets focus on trichromatic intensity images to mimic human vision. However, +polarization and spectrum, the wave properties of light that animals in harsh +environments and with limited brain capacity often rely on, remain +underrepresented in existing datasets. Although spectro-polarimetric datasets +exist, these datasets have insufficient object diversity, limited illumination +conditions, linear-only polarization data, and inadequate image count. Here, we +introduce two spectro-polarimetric datasets: trichromatic Stokes images and +hyperspectral Stokes images. These novel datasets encompass both linear and +circular polarization; they introduce multiple spectral channels; and they +feature a broad selection of real-world scenes. With our dataset in hand, we +analyze the spectro-polarimetric image statistics, develop efficient +representations of such high-dimensional data, and evaluate spectral dependency +of shape-from-polarization methods. As such, the proposed dataset promises a +foundation for data-driven spectro-polarimetric imaging and vision research. +Dataset and code will be publicly available.",cs.CV,"['cs.CV', 'eess.IV']" +LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation,Linfeng Yuan · Miaojing Shi · Zijie Yue · Qijun Chen, ,https://arxiv.org/abs/2306.08736,,2306.08736.pdf,LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation,"Referring video object segmentation (RVOS) aims to segment the target +instance referred by a given text expression in a video clip. The text +expression normally contains sophisticated description of the instance's +appearance, action, and relation with others. It is therefore rather difficult +for a RVOS model to capture all these attributes correspondingly in the video; +in fact, the model often favours more on the action- and relation-related +visual attributes of the instance. This can end up with partial or even +incorrect mask prediction of the target instance. We tackle this problem by +taking a subject-centric short text expression from the original long text +expression. The short one retains only the appearance-related information of +the target instance so that we can use it to focus the model's attention on the +instance's appearance. We let the model make joint predictions using both long +and short text expressions; and insert a long-short cross-attention module to +interact the joint features and a long-short predictions intersection loss to +regulate the joint predictions. Besides the improvement on the linguistic part, +we also introduce a forward-backward visual consistency loss, which utilizes +optical flows to warp visual features between the annotated frames and their +temporal neighbors for consistency. We build our method on top of two state of +the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS, +JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our +method.Code is available at https://github.com/LinfengYuan1997/Losh.",cs.CV,['cs.CV'] +FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures,Lisa Mais · Peter Hirsch · Claire Managan · Ramya Kandarpa · Josef Rumberger · Annika Reinke · Lena Maier-Hein · Gudrun Ihrke · Dagmar Kainmueller, ,https://arxiv.org/abs/2404.00130,,2404.00130.pdf,FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures,"Instance segmentation of neurons in volumetric light microscopy images of +nervous systems enables groundbreaking research in neuroscience by facilitating +joint functional and morphological analyses of neural circuits at cellular +resolution. Yet said multi-neuron light microscopy data exhibits extremely +challenging properties for the task of instance segmentation: Individual +neurons have long-ranging, thin filamentous and widely branching morphologies, +multiple neurons are tightly inter-weaved, and partial volume effects, uneven +illumination and noise inherent to light microscopy severely impede local +disentangling as well as long-range tracing of individual neurons. These +properties reflect a current key challenge in machine learning research, namely +to effectively capture long-range dependencies in the data. While respective +methodological research is buzzing, to date methods are typically benchmarked +on synthetic datasets. To address this gap, we release the FlyLight Instance +Segmentation Benchmark (FISBe) dataset, the first publicly available +multi-neuron light microscopy dataset with pixel-wise annotations. In addition, +we define a set of instance segmentation metrics for benchmarking that we +designed to be meaningful with regard to downstream analyses. Lastly, we +provide three baselines to kick off a competition that we envision to both +advance the field of machine learning regarding methodology for capturing +long-range data dependencies, and facilitate scientific discovery in basic +neuroscience.",cs.CV,"['cs.CV', 'cs.LG']" +Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos,Sagnik Majumder · Ziad Al-Halah · Kristen Grauman, ,https://arxiv.org/abs/2307.04760,,2307.04760.pdf,Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos,"We propose a self-supervised method for learning representations based on +spatial audio-visual correspondences in egocentric videos. Our method uses a +masked auto-encoding framework to synthesize masked binaural (multi-channel) +audio through the synergy of audio and vision, thereby learning useful spatial +relationships between the two modalities. We use our pretrained features to +tackle two downstream video tasks requiring spatial understanding in social +scenarios: active speaker detection and spatial audio denoising. Through +extensive experiments, we show that our features are generic enough to improve +over multiple state-of-the-art baselines on both tasks on two challenging +egocentric video datasets that offer binaural audio, EgoCom and EasyCom. +Project: http://vision.cs.utexas.edu/projects/ego_av_corr.",cs.CV,"['cs.CV', 'cs.SD', 'eess.AS']" +Motion Blur Decomposition with Cross-shutter Guidance,Xiang Ji · Haiyang Jiang · Yinqiang Zheng,https://jixiang2016.github.io/dualBR_site/,https://arxiv.org/abs/2404.01120,,2404.01120.pdf,Motion Blur Decomposition with Cross-shutter Guidance,"Motion blur is a frequently observed image artifact, especially under +insufficient illumination where exposure time has to be prolonged so as to +collect more photons for a bright enough image. Rather than simply removing +such blurring effects, recent researches have aimed at decomposing a blurry +image into multiple sharp images with spatial and temporal coherence. Since +motion blur decomposition itself is highly ambiguous, priors from neighbouring +frames or human annotation are usually needed for motion disambiguation. In +this paper, inspired by the complementary exposure characteristics of a global +shutter (GS) camera and a rolling shutter (RS) camera, we propose to utilize +the ordered scanline-wise delay in a rolling shutter image to robustify motion +decomposition of a single blurry image. To evaluate this novel dual imaging +setting, we construct a triaxial system to collect realistic data, as well as a +deep network architecture that explicitly addresses temporal and contextual +information through reciprocal branches for cross-shutter motion blur +decomposition. Experiment results have verified the effectiveness of our +proposed algorithm, as well as the validity of our dual imaging setting.",cs.CV,['cs.CV'] +LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition,Zhonglin Sun · Chen Feng · Ioannis Patras · Georgios Tzimiropoulos, ,https://arxiv.org/abs/2403.08161,,2403.08161.pdf,LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition,"In this work we focus on learning facial representations that can be adapted +to train effective face recognition models, particularly in the absence of +labels. Firstly, compared with existing labelled face datasets, a vastly larger +magnitude of unlabeled faces exists in the real world. We explore the learning +strategy of these unlabeled facial images through self-supervised pretraining +to transfer generalized face recognition performance. Moreover, motivated by +one recent finding, that is, the face saliency area is critical for face +recognition, in contrast to utilizing random cropped blocks of images for +constructing augmentations in pretraining, we utilize patches localized by +extracted facial landmarks. This enables our method - namely LAndmark-based +Facial Self-supervised learning LAFS), to learn key representation that is more +critical for face recognition. We also incorporate two landmark-specific +augmentations which introduce more diversity of landmark information to further +regularize the learning. With learned landmark-based facial representations, we +further adapt the representation for face recognition with regularization +mitigating variations in landmark positions. Our method achieves significant +improvement over the state-of-the-art on multiple face recognition benchmarks, +especially on more challenging few-shot scenarios.",cs.CV,"['cs.CV', 'cs.AI']" +Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation,Wenxuan Wang · Tongtian Yue · Yisi Zhang · Longteng Guo · Xingjian He · Xinlong Wang · Jing Liu, ,https://arxiv.org/abs/2312.08007,,2312.08007.pdf,Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation,"Referring expression segmentation (RES) aims at segmenting the foreground +masks of the entities that match the descriptive natural language expression. +Previous datasets and methods for classic RES task heavily rely on the prior +assumption that one expression must refer to object-level targets. In this +paper, we take a step further to finer-grained part-level RES task. To promote +the object-level RES task towards finer-grained vision-language understanding, +we put forward a new multi-granularity referring expression segmentation (MRES) +task and construct an evaluation benchmark called RefCOCOm by manual +annotations. By employing our automatic model-assisted data engine, we build +the largest visual grounding dataset namely MRES-32M, which comprises over +32.2M high-quality masks and captions on the provided 1M images. Besides, a +simple yet strong model named UniRES is designed to accomplish the unified +object-level and part-level grounding task. Extensive experiments on our +RefCOCOm for MRES and three datasets (i.e., RefCOCO(+/g) for classic RES task +demonstrate the superiority of our method over previous state-of-the-art +methods. To foster future research into fine-grained visual grounding, our +benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly +available at https://github.com/Rubics-Xuan/MRES",cs.CV,['cs.CV'] +Event-based Visible and Infrared Fusion via Multi-task Collaboration,Mengyue Geng · Lin Zhu · Lizhi Wang · Wei Zhang · Ruiqin Xiong · Yonghong Tian, ,https://arxiv.org/abs/2312.04328,,2312.04328.pdf,A Multi-scale Information Integration Framework for Infrared and Visible Image Fusion,"Infrared and visible image fusion aims at generating a fused image containing +the intensity and detail information of source images, and the key issue is +effectively measuring and integrating the complementary information of +multi-modality images from the same scene. Existing methods mostly adopt a +simple weight in the loss function to decide the information retention of each +modality rather than adaptively measuring complementary information for +different image pairs. In this study, we propose a multi-scale dual attention +(MDA) framework for infrared and visible image fusion, which is designed to +measure and integrate complementary information in both structure and loss +function at the image and patch level. In our method, the residual downsample +block decomposes source images into three scales first. Then, dual attention +fusion block integrates complementary information and generates a spatial and +channel attention map at each scale for feature fusion. Finally, the output +image is reconstructed by the residual reconstruction block. Loss function +consists of image-level, feature-level and patch-level three parts, of which +the calculation of the image-level and patch-level two parts are based on the +weights generated by the complementary information measurement. Indeed, to +constrain the pixel intensity distribution between the output and infrared +image, a style loss is added. Our fusion results perform robust and informative +across different scenarios. Qualitative and quantitative results on two +datasets illustrate that our method is able to preserve both thermal radiation +and detailed information from two modalities and achieve comparable results +compared with the other state-of-the-art methods. Ablation experiments show the +effectiveness of our information integration architecture and adaptively +measure complementary information retention in the loss function.",cs.CV,['cs.CV'] +PFStorer: Personalized Face Restoration and Super-Resolution,Tuomas Varanka · Tapani Toivonen · Soumya Tripathy · Guoying Zhao · Erman Acar, ,https://arxiv.org/abs/2403.08436,,2403.08436.pdf,PFStorer: Personalized Face Restoration and Super-Resolution,"Recent developments in face restoration have achieved remarkable results in +producing high-quality and lifelike outputs. The stunning results however often +fail to be faithful with respect to the identity of the person as the models +lack necessary context. In this paper, we explore the potential of personalized +face restoration with diffusion models. In our approach a restoration model is +personalized using a few images of the identity, leading to tailored +restoration with respect to the identity while retaining fine-grained details. +By using independent trainable blocks for personalization, the rich prior of a +base restoration model can be exploited to its fullest. To avoid the model +relying on parts of identity left in the conditioning low-quality images, a +generative regularizer is employed. With a learnable parameter, the model +learns to balance between the details generated based on the input image and +the degree of personalization. Moreover, we improve the training pipeline of +face restoration models to enable an alignment-free approach. We showcase the +robust capabilities of our approach in several real-world scenarios with +multiple identities, demonstrating our method's ability to generate +fine-grained details with faithful restoration. In the user study we evaluate +the perceptual quality and faithfulness of the genereated details, with our +method being voted best 61% of the time compared to the second best with 25% of +the votes.",cs.CV,['cs.CV'] +UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and Unfavorable Sets,Youngju Na · Woo Jae Kim · Kyu Han · Suhyeon Ha · Sung-Eui Yoon, ,https://arxiv.org/abs/2403.05086,,2403.05086.pdf,UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and UnFavOrable Sets,"Generalizable neural implicit surface reconstruction aims to obtain an +accurate underlying geometry given a limited number of multi-view images from +unseen scenes. However, existing methods select only informative and relevant +views using predefined scores for training and testing phases. This constraint +renders the model impractical in real-world scenarios, where the availability +of favorable combinations cannot always be ensured. We introduce and validate a +view-combination score to indicate the effectiveness of the input view +combination. We observe that previous methods output degenerate solutions under +arbitrary and unfavorable sets. Building upon this finding, we propose +UFORecon, a robust view-combination generalizable surface reconstruction +framework. To achieve this, we apply cross-view matching transformers to model +interactions between source images and build correlation frustums to capture +global correlations. Additionally, we explicitly encode pairwise feature +similarities as view-consistent priors. Our proposed framework significantly +outperforms previous methods in terms of view-combination generalizability and +also in the conventional generalizable protocol trained with favorable +view-combinations. The code is available at +https://github.com/Youngju-Na/UFORecon.",cs.CV,['cs.CV'] +Generalizable Face Landmarking Guided by Conditional Face Warping,Jiayi Liang · Haotian Liu · Hongteng Xu · Dixin Luo,https://plustwo0.github.io/project-face-landmarker/,https://arxiv.org/abs/2404.12322,,2404.12322.pdf,Generalizable Face Landmarking Guided by Conditional Face Warping,"As a significant step for human face modeling, editing, and generation, face +landmarking aims at extracting facial keypoints from images. A generalizable +face landmarker is required in practice because real-world facial images, e.g., +the avatars in animations and games, are often stylized in various ways. +However, achieving generalizable face landmarking is challenging due to the +diversity of facial styles and the scarcity of labeled stylized faces. In this +study, we propose a simple but effective paradigm to learn a generalizable face +landmarker based on labeled real human faces and unlabeled stylized faces. Our +method learns the face landmarker as the key module of a conditional face +warper. Given a pair of real and stylized facial images, the conditional face +warper predicts a warping field from the real face to the stylized one, in +which the face landmarker predicts the ending points of the warping field and +provides us with high-quality pseudo landmarks for the corresponding stylized +facial images. Applying an alternating optimization strategy, we learn the face +landmarker to minimize $i)$ the discrepancy between the stylized faces and the +warped real ones and $ii)$ the prediction errors of both real and pseudo +landmarks. Experiments on various datasets show that our method outperforms +existing state-of-the-art domain adaptation methods in face landmarking tasks, +leading to a face landmarker with better generalizability. Code is available at +https://plustwo0.github.io/project-face-landmarker.",cs.CV,"['cs.CV', 'cs.AI']" +PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs,Michael Dorkenwald · Nimrod Barazani · Cees G. M. Snoek · Yuki Asano,https://quva-lab.github.io/PIN/,https://arxiv.org/abs/2402.08657,,2402.08657.pdf,PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs,"Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown +immense potential by integrating large language models with vision systems. +Nevertheless, these models face challenges in the fundamental computer vision +task of object localisation, due to their training on multimodal data +containing mostly captions without explicit spatial grounding. While it is +possible to construct custom, supervised training pipelines with bounding box +annotations that integrate with VLMs, these result in specialized and +hard-to-scale models. In this paper, we aim to explore the limits of +caption-based VLMs and instead propose to tackle the challenge in a simpler +manner by i) keeping the weights of a caption-based VLM frozen and ii) not +using any supervised detection data. To this end, we introduce an +input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing +a minimal set of parameters that are slid inside the frozen VLM, unlocking +object localisation capabilities. Our PIN module is trained with a simple +next-token prediction task on synthetic data without requiring the introduction +of new output heads. Our experiments demonstrate strong zero-shot localisation +performances on a variety of images, including Pascal VOC, COCO, LVIS, and +diverse images like paintings or cartoons.",cs.CV,['cs.CV'] +SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,Keqi Chen · vinkle srivastav · Nicolas Padoy,https://github.com/CAMMA-public/SelfPose3d/,https://arxiv.org/abs/2404.02041,,,SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,"We present a new self-supervised approach, SelfPose3d, for estimating 3d +poses of multiple persons from multiple camera views. Unlike current +state-of-the-art fully-supervised methods, our approach does not require any 2d +or 3d ground-truth poses and uses only the multi-view input images from a +calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d +human pose estimator. We propose two self-supervised learning objectives: +self-supervised person localization in 3d space and self-supervised 3d pose +estimation. We achieve self-supervised 3d person localization by training the +model on synthetically generated 3d points, serving as 3d person root +positions, and on the projected root-heatmaps in all the views. We then model +the 3d poses of all the localized persons with a bottleneck representation, map +them onto all views obtaining 2d joints, and render them using 2d Gaussian +heatmaps in an end-to-end differentiable manner. Afterwards, we use the +corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To +alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive +supervision attention mechanism to guide the self-supervision. Our experiments +and analysis on three public benchmark datasets, including Panoptic, Shelf, and +Campus, show the effectiveness of our approach, which is comparable to +fully-supervised methods. Code is available at +\url{https://github.com/CAMMA-public/SelfPose3D}",cs.CV,['cs.CV'] +PEM: Prototype-based Efficient MaskFormer for Image Segmentation,Niccolò Cavagnero · Gabriele Rosi · Claudia Cuttano · Francesca Pistilli · Marco Ciccone · Giuseppe Averta · Fabio Cermelli,https://niccolocavagnero.github.io/PEM/,https://arxiv.org/abs/2402.19422,,2402.19422.pdf,PEM: Prototype-based Efficient MaskFormer for Image Segmentation,"Recent transformer-based architectures have shown impressive results in the +field of image segmentation. Thanks to their flexibility, they obtain +outstanding performance in multiple segmentation tasks, such as semantic and +panoptic, under a single unified framework. To achieve such impressive +performance, these architectures employ intensive operations and require +substantial computational resources, which are often not available, especially +on edge devices. To fill this gap, we propose Prototype-based Efficient +MaskFormer (PEM), an efficient transformer-based architecture that can operate +in multiple segmentation tasks. PEM proposes a novel prototype-based +cross-attention which leverages the redundancy of visual features to restrict +the computation and improve the efficiency without harming the performance. In +addition, PEM introduces an efficient multi-scale feature pyramid network, +capable of extracting features that have high semantic content in an efficient +way, thanks to the combination of deformable convolutions and context-based +self-modulation. We benchmark the proposed PEM architecture on two tasks, +semantic and panoptic segmentation, evaluated on two different datasets, +Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task +and dataset, outperforming task-specific architectures while being comparable +and even better than computationally-expensive baselines.",cs.CV,"['cs.CV', 'cs.AI']" +Improving Distant 3D Object Detection Using 2D Box Supervision,Zetong Yang · Zhiding Yu · Christopher Choy · Renhao Wang · Anima Anandkumar · Jose M. Alvarez, ,https://arxiv.org/abs/2403.09230,,2403.09230.pdf,Improving Distant 3D Object Detection Using 2D Box Supervision,"Improving the detection of distant 3d objects is an important yet challenging +task. For camera-based 3D perception, the annotation of 3d bounding relies +heavily on LiDAR for accurate depth information. As such, the distance of +annotation is often limited due to the sparsity of LiDAR points on distant +objects, which hampers the capability of existing detectors for long-range +scenarios. We address this challenge by considering only 2D box supervision for +distant objects since they are easy to annotate. We propose LR3D, a framework +that learns to recover the missing depth of distant objects. LR3D adopts an +implicit projection head to learn the generation of mapping between 2D boxes +and depth using the 3D supervision on close objects. This mapping allows the +depth estimation of distant objects conditioned on their 2D boxes, making +long-range 3D detection with 2D supervision feasible. Experiments show that +without distant 3D annotations, LR3D allows camera-based methods to detect +distant objects (over 200m) with comparable accuracy to full 3D supervision. +Our framework is general, and could widely benefit 3D detection methods to a +large extent.",cs.CV,['cs.CV'] +Visual Point Cloud Forecasting enables Scalable Autonomous Driving,Zetong Yang · Li Chen · Yanan Sun · Hongyang Li,https://github.com/OpenDriveLab/ViDAR,https://arxiv.org/abs/2312.17655,,2312.17655.pdf,Visual Point Cloud Forecasting enables Scalable Autonomous Driving,"In contrast to extensive studies on general vision, pre-training for scalable +visual autonomous driving remains seldom explored. Visual autonomous driving +applications require features encompassing semantics, 3D geometry, and temporal +information simultaneously for joint perception, prediction, and planning, +posing dramatic challenges for pre-training. To resolve this, we bring up a new +pre-training task termed as visual point cloud forecasting - predicting future +point clouds from historical visual input. The key merit of this task captures +the synergic learning of semantics, 3D structures, and temporal dynamics. Hence +it shows superiority in various downstream tasks. To cope with this new +problem, we present ViDAR, a general model to pre-train downstream visual +encoders. It first extracts historical embeddings by the encoder. These +representations are then transformed to 3D geometric space via a novel Latent +Rendering operator for future point cloud prediction. Experiments show +significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, ~10% +error reduction on motion forecasting, and ~15% less collision rate on +planning.",cs.CV,['cs.CV'] +Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting,Haipeng Liu · Yang Wang · Biao Qian · Meng Wang · Yong Rui,https://github.com/htyjers/StrDiffusion,https://arxiv.org/abs/2403.19898,,2403.19898.pdf,Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting,"Denoising diffusion probabilistic models for image inpainting aim to add the +noise to the texture of image during the forward process and recover masked +regions with unmasked ones of the texture via the reverse denoising process. +Despite the meaningful semantics generation, the existing arts suffer from the +semantic discrepancy between masked and unmasked regions, since the +semantically dense unmasked texture fails to be completely degraded while the +masked regions turn to the pure noise in diffusion process, leading to the +large discrepancy between them. In this paper, we aim to answer how unmasked +semantics guide texture denoising process;together with how to tackle the +semantic discrepancy, to facilitate the consistent and meaningful semantics +generation. To this end, we propose a novel structure-guided diffusion model +named StrDiffusion, to reformulate the conventional texture denoising process +under structure guidance to derive a simplified denoising objective for image +inpainting, while revealing: 1) the semantically sparse structure is beneficial +to tackle semantic discrepancy in early stage, while dense texture generates +reasonable semantics in late stage; 2) the semantics from unmasked regions +essentially offer the time-dependent structure guidance for the texture +denoising process, benefiting from the time-dependent sparsity of the structure +semantics. For the denoising process, a structure-guided neural network is +trained to estimate the simplified denoising objective by exploiting the +consistency of the denoised structure between masked and unmasked regions. +Besides, we devise an adaptive resampling strategy as a formal criterion as +whether structure is competent to guide the texture denoising process, while +regulate their semantic correlations. Extensive experiments validate the merits +of StrDiffusion over the state-of-the-arts. Our code is available at +https://github.com/htyjers/StrDiffusion.",cs.CV,['cs.CV'] +Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models,Kota Sueyoshi · Takashi Matsubara, ,https://arxiv.org/abs/2311.16117,,2311.16117.pdf,Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models,"Diffusion models have achieved remarkable results in generating high-quality, +diverse, and creative images. However, when it comes to text-based image +generation, they often fail to capture the intended meaning presented in the +text. For instance, a specified object may not be generated, an unnecessary +object may be generated, and an adjective may alter objects it was not intended +to modify. Moreover, we found that relationships indicating possession between +objects are often overlooked. While users' intentions in text are diverse, +existing methods tend to specialize in only some aspects of these. In this +paper, we propose Predicated Diffusion, a unified framework to express users' +intentions. We consider that the root of the above issues lies in the text +encoder, which often focuses only on individual words and neglects the logical +relationships between them. The proposed method does not solely rely on the +text encoder, but instead, represents the intended meaning in the text as +propositions using predicate logic and treats the pixels in the attention maps +as the fuzzy predicates. This enables us to obtain a differentiable loss +function that makes the image fulfill the proposition by minimizing it. When +compared to several existing methods, we demonstrated that Predicated Diffusion +can generate images that are more faithful to various text prompts, as verified +by human evaluators and pretrained image-text models.",cs.CV,['cs.CV'] +Probabilistic Human Mesh Estimation with Hypothesis Scoring,Yuan Xu · Xiaoxuan Ma · Jiajun Su · Wentao Zhu · Yu Qiao · Yizhou Wang, ,https://arxiv.org/abs/2308.02963,,2308.02963.pdf,Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models,"This work focuses on the problem of reconstructing a 3D human body mesh from +a given 2D image. Despite the inherent ambiguity of the task of human mesh +recovery, most existing works have adopted a method of regressing a single +output. In contrast, we propose a generative approach framework, called +""Diffusion-based Human Mesh Recovery (Diff-HMR)"" that takes advantage of the +denoising diffusion process to account for multiple plausible outcomes. During +the training phase, the SMPL parameters are diffused from ground-truth +parameters to random distribution, and Diff-HMR learns the reverse process of +this diffusion. In the inference phase, the model progressively refines the +given random SMPL parameters into the corresponding parameters that align with +the input image. Diff-HMR, being a generative approach, is capable of +generating diverse results for the same input image as the input noise varies. +We conduct validation experiments, and the results demonstrate that the +proposed framework effectively models the inherent ambiguity of the task of +human mesh recovery in a probabilistic manner. The code is available at +https://github.com/hanbyel0105/Diff-HMR",cs.CV,['cs.CV'] +TexVocab: Texture Vocabulary-conditioned Human Avatars,Yuxiao Liu · Zhe Li · Yebin Liu · Haoqian Wang, ,https://arxiv.org/abs/2404.00524,,2404.00524.pdf,TexVocab: Texture Vocabulary-conditioned Human Avatars,"To adequately utilize the available image evidence in multi-view video-based +avatar modeling, we propose TexVocab, a novel avatar representation that +constructs a texture vocabulary and associates body poses with texture maps for +animation. Given multi-view RGB videos, our method initially back-projects all +the available images in the training videos to the posed SMPL surface, +producing texture maps in the SMPL UV domain. Then we construct pairs of human +poses and texture maps to establish a texture vocabulary for encoding dynamic +human appearances under various poses. Unlike the commonly used joint-wise +manner, we further design a body-part-wise encoding strategy to learn the +structural effects of the kinematic chain. Given a driving pose, we query the +pose feature hierarchically by decomposing the pose vector into several body +parts and interpolating the texture features for synthesizing fine-grained +human dynamics. Overall, our method is able to create animatable human avatars +with detailed and dynamic appearances from RGB videos, and the experiments show +that our method outperforms state-of-the-art approaches. The project page can +be found at https://texvocab.github.io/.",cs.CV,['cs.CV'] +LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network,Hao Yang · Liyuan Pan · Yan Yang · Richard Hartley · Miaomiao Liu, ,https://arxiv.org/abs/2307.09815,,2307.09815.pdf,LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network,"Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent +blur is a challenging task.~Existing blur map-based deblurring methods have +demonstrated promising results. In this paper, we propose, to the best of our +knowledge, the first framework that introduces the contrastive language-image +pre-training framework (CLIP) to accurately estimate the blur map from a DP +pair unsupervisedly. To achieve this, we first carefully design text prompts to +enable CLIP to understand blur-related geometric prior knowledge from the DP +pair. Then, we propose a format to input a stereo DP pair to CLIP without any +fine-tuning, despite the fact that CLIP is pre-trained on monocular images. +Given the estimated blur map, we introduce a blur-prior attention block, a +blur-weighting loss, and a blur-aware loss to recover the all-in-focus image. +Our method achieves state-of-the-art performance in extensive experiments (see +Fig.~\ref{fig:teaser}).",cs.CV,['cs.CV'] +VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection,Zihua Liu · Hiroki Sakuma · Masatoshi Okutomi,http://www.ok.sc.e.titech.ac.jp/res/VSRD/index.html,https://arxiv.org/abs/2404.00149,,2404.00149.pdf,VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection,"Monocular 3D object detection poses a significant challenge in 3D scene +understanding due to its inherently ill-posed nature in monocular depth +estimation. Existing methods heavily rely on supervised learning using abundant +3D labels, typically obtained through expensive and labor-intensive annotation +on LiDAR point clouds. To tackle this problem, we propose a novel weakly +supervised 3D object detection framework named VSRD (Volumetric Silhouette +Rendering for Detection) to train 3D object detectors without any 3D +supervision but only weak 2D supervision. VSRD consists of multi-view 3D +auto-labeling and subsequent training of monocular 3D object detectors using +the pseudo labels generated in the auto-labeling stage. In the auto-labeling +stage, we represent the surface of each instance as a signed distance field +(SDF) and render its silhouette as an instance mask through our proposed +instance-aware volumetric silhouette rendering. To directly optimize the 3D +bounding boxes through rendering, we decompose the SDF of each instance into +the SDF of a cuboid and the residual distance field (RDF) that represents the +residual from the cuboid. This mechanism enables us to optimize the 3D bounding +boxes in an end-to-end manner by comparing the rendered instance masks with the +ground truth instance masks. The optimized 3D bounding boxes serve as effective +training data for 3D object detection. We conduct extensive experiments on the +KITTI-360 dataset, demonstrating that our method outperforms the existing +weakly supervised 3D object detection methods. The code is available at +https://github.com/skmhrk1209/VSRD.",cs.CV,['cs.CV'] +Real-World Mobile Image Denoising Dataset with Efficient Baselines,Roman Flepp · Andrey Ignatov · Radu Timofte · Luc Van Gool, ,https://arxiv.org/html/2404.08514v2,,2404.08514v2.pdf,NIR-Assisted Image Denoising: A Selective Fusion Approach and A Real-World Benchmark Datase,"Despite the significant progress in image denoising, it is still challenging +to restore fine-scale details while removing noise, especially in extremely +low-light environments. Leveraging near-infrared (NIR) images to assist visible +RGB image denoising shows the potential to address this issue, becoming a +promising technology. Nonetheless, existing works still struggle with taking +advantage of NIR information effectively for real-world image denoising, due to +the content inconsistency between NIR-RGB images and the scarcity of real-world +paired datasets. To alleviate the problem, we propose an efficient Selective +Fusion Module (SFM), which can be plug-and-played into the advanced denoising +networks to merge the deep NIR-RGB features. Specifically, we sequentially +perform the global and local modulation for NIR and RGB features, and then +integrate the two modulated features. Furthermore, we present a Real-world +NIR-Assisted Image Denoising (Real-NAID) dataset, which covers diverse +scenarios as well as various noise levels. Extensive experiments on both +synthetic and our real-world datasets demonstrate that the proposed method +achieves better results than state-of-the-art ones.",cs.CV,['cs.CV'] +Exploiting Style Latent Flows for Generalizing Video Deepfake Detection,Jongwook Choi · Taehoon Kim · Yonghyun Jeong · Seungryul Baek · Jongwon Choi, ,https://arxiv.org/abs/2403.06592v1,,2403.06592v1.pdf,Exploiting Style Latent Flows for Generalizing Deepfake Detection Video Detection,"This paper presents a new approach for the detection of fake videos, based on +the analysis of style latent vectors and their abnormal behavior in temporal +changes in the generated videos. We discovered that the generated facial videos +suffer from the temporal distinctiveness in the temporal changes of style +latent vectors, which are inevitable during the generation of temporally stable +videos with various facial expressions and geometric transformations. Our +framework utilizes the StyleGRU module, trained by contrastive learning, to +represent the dynamic properties of style latent vectors. Additionally, we +introduce a style attention module that integrates StyleGRU-generated features +with content-based features, enabling the detection of visual and temporal +artifacts. We demonstrate our approach across various benchmark scenarios in +deepfake detection, showing its superiority in cross-dataset and +cross-manipulation scenarios. Through further analysis, we also validate the +importance of using temporal changes of style latent vectors to improve the +generality of deepfake video detection.",cs.CV,"['cs.CV', 'cs.AI']" +3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation,Zidu Wang · Xiangyu Zhu · Tianshuo Zhang · baiqin wang · Zhen Lei,https://github.com/wang-zidu/3DDFA-V3,https://arxiv.org/abs/2312.00311,,2312.00311.pdf,3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation,"3D Morphable Models (3DMMs) provide promising 3D face reconstructions in +various applications. However, existing methods struggle to reconstruct faces +with extreme expressions due to deficiencies in supervisory signals, such as +sparse or inaccurate landmarks. Segmentation information contains effective +geometric contexts for face reconstruction. Certain attempts intuitively depend +on differentiable renderers to compare the rendered silhouettes of +reconstruction with segmentation, which is prone to issues like local optima +and gradient instability. In this paper, we fully utilize the facial part +segmentation geometry by introducing Part Re-projection Distance Loss (PRDL). +Specifically, PRDL transforms facial part segmentation into 2D points and +re-projects the reconstruction onto the image plane. Subsequently, by +introducing grid anchors and computing different statistical distances from +these anchors to the point sets, PRDL establishes geometry descriptors to +optimize the distribution of the point sets for face reconstruction. PRDL +exhibits a clear gradient compared to the renderer-based methods and presents +state-of-the-art reconstruction performance in extensive quantitative and +qualitative experiments. Our project is available at +https://github.com/wang-zidu/3DDFA-V3 .",cs.CV,['cs.CV'] +PerceptionGPT: Effectively Fusing Visual Perception into LLM,Renjie Pi · Lewei Yao · Jiahui Gao · Jipeng Zhang · Tong Zhang, ,https://arxiv.org/abs/2311.06612,,2311.06612.pdf,PerceptionGPT: Effectively Fusing Visual Perception into LLM,"The integration of visual inputs with large language models (LLMs) has led to +remarkable advancements in multi-modal capabilities, giving rise to visual +large language models (VLLMs). However, effectively harnessing VLLMs for +intricate visual perception tasks remains a challenge. In this paper, we +present a novel end-to-end framework named PerceptionGPT, which efficiently and +effectively equips the VLLMs with visual perception abilities by leveraging the +representation power of LLMs' token embedding. Our proposed method treats the +token embedding of the LLM as the carrier of spatial information, then leverage +lightweight visual task encoders and decoders to perform visual perception +tasks (e.g., detection, segmentation). Our approach significantly alleviates +the training difficulty suffered by previous approaches that formulate the +visual outputs as discrete tokens, and enables achieving superior performance +with fewer trainable parameters, less training data and shorted training time. +Moreover, as only one token embedding is required to decode the visual outputs, +the resulting sequence length during inference is significantly reduced. +Consequently, our approach enables accurate and flexible representations, +seamless integration of visual perception tasks, and efficient handling of a +multiple of visual outputs. We validate the effectiveness and efficiency of our +approach through extensive experiments. The results demonstrate significant +improvements over previous methods with much fewer trainable parameters and GPU +hours, which facilitates future research in enabling LLMs with visual +perception abilities.",cs.CV,"['cs.CV', 'cs.CL']" +In Search of a Data Transformation That Accelerates Neural Field Training,Junwon Seo · Sangyoon Lee · Kwang In Kim · Jaeho Lee, ,https://arxiv.org/abs/2311.17094,,2311.17094.pdf,In Search of a Data Transformation That Accelerates Neural Field Training,"Neural field is an emerging paradigm in data representation that trains a +neural network to approximate the given signal. A key obstacle that prevents +its widespread adoption is the encoding speed-generating neural fields requires +an overfitting of a neural network, which can take a significant number of SGD +steps to reach the desired fidelity level. In this paper, we delve into the +impacts of data transformations on the speed of neural field training, +specifically focusing on how permuting pixel locations affect the convergence +speed of SGD. Counterintuitively, we find that randomly permuting the pixel +locations can considerably accelerate the training. To explain this phenomenon, +we examine the neural field training through the lens of PSNR curves, loss +landscapes, and error patterns. Our analyses suggest that the random pixel +permutations remove the easy-to-fit patterns, which facilitate easy +optimization in the early stage but hinder capturing fine details of the +signal.",cs.LG,"['cs.LG', 'cs.CV']" +Multi-view Aggregation Network for Dichotomous Image Segmentation,Qian Yu · Xiaoqi Zhao · Youwei Pang · Lihe Zhang · Huchuan Lu, ,https://arxiv.org/abs/2404.07445,,2404.07445.pdf,Multi-view Aggregation Network for Dichotomous Image Segmentation,"Dichotomous Image Segmentation (DIS) has recently emerged towards +high-precision object segmentation from high-resolution natural images. + When designing an effective DIS model, the main challenge is how to balance +the semantic dispersion of high-resolution targets in the small receptive field +and the loss of high-precision details in the large receptive field. Existing +methods rely on tedious multiple encoder-decoder streams and stages to +gradually complete the global localization and local refinement. + Human visual system captures regions of interest by observing them from +multiple views. Inspired by it, we model DIS as a multi-view object perception +problem and provide a parsimonious multi-view aggregation network (MVANet), +which unifies the feature fusion of the distant view and close-up view into a +single stream with one encoder-decoder structure. With the help of the proposed +multi-view complementary localization and refinement modules, our approach +established long-range, profound visual interactions across multiple views, +allowing the features of the detailed close-up view to focus on highly slender +structures.Experiments on the popular DIS-5K dataset show that our MVANet +significantly outperforms state-of-the-art methods in both accuracy and speed. +The source code and datasets will be publicly available at +\href{https://github.com/qianyu-dlut/MVANet}{MVANet}.",cs.CV,['cs.CV'] +Three Pillars improving Vision Foundation Model Distillation for Lidar,Gilles Puy · Spyros Gidaris · Alexandre Boulch · Oriane Siméoni · Corentin Sautier · Patrick Pérez · Andrei Bursuc · Renaud Marlet,https://github.com/valeoai/ScaLR,https://arxiv.org/abs/2310.17504,,2310.17504.pdf,Three Pillars improving Vision Foundation Model Distillation for Lidar,"Self-supervised image backbones can be used to address complex 2D tasks +(e.g., semantic segmentation, object discovery) very efficiently and with +little or no downstream supervision. Ideally, 3D backbones for lidar should be +able to inherit these properties after distillation of these powerful 2D +features. The most recent methods for image-to-lidar distillation on autonomous +driving data show promising results, obtained thanks to distillation methods +that keep improving. Yet, we still notice a large performance gap when +measuring the quality of distilled and fully supervised features by linear +probing. In this work, instead of focusing only on the distillation method, we +study the effect of three pillars for distillation: the 3D backbone, the +pretrained 2D backbones, and the pretraining dataset. In particular, thanks to +our scalable distillation method named ScaLR, we show that scaling the 2D and +3D backbones and pretraining on diverse datasets leads to a substantial +improvement of the feature quality. This allows us to significantly reduce the +gap between the quality of distilled and fully-supervised 3D features, and to +improve the robustness of the pretrained backbones to domain gaps and +perturbations.",cs.CV,['cs.CV'] +Cloud-Device Collaborative Learning for Multimodal Large Language Models,Guanqun Wang · Jiaming Liu · Chenxuan Li · Yuan Zhang · Ma Junpeng · Xinyu Wei · Kevin Zhang · Maurice Chong · Renrui Zhang · Yijiang Liu · Shanghang Zhang,https://github.com/2644521362/Cdcca/tree/main,https://arxiv.org/abs/2312.16279,,2312.16279.pdf,Cloud-Device Collaborative Learning for Multimodal Large Language Models,"The burgeoning field of Multimodal Large Language Models (MLLMs) has +exhibited remarkable performance in diverse tasks such as captioning, +commonsense reasoning, and visual scene understanding. However, the deployment +of these large-scale MLLMs on client devices is hindered by their extensive +model parameters, leading to a notable decline in generalization capabilities +when these models are compressed for device deployment. Addressing this +challenge, we introduce a Cloud-Device Collaborative Continual Adaptation +framework, designed to enhance the performance of compressed, device-deployed +MLLMs by leveraging the robust capabilities of cloud-based, larger-scale MLLMs. +Our framework is structured into three key components: a device-to-cloud uplink +for efficient data transmission, cloud-based knowledge adaptation, and an +optimized cloud-to-device downlink for model deployment. In the uplink phase, +we employ an Uncertainty-guided Token Sampling (UTS) strategy to effectively +filter out-of-distribution tokens, thereby reducing transmission costs and +improving training efficiency. On the cloud side, we propose Adapter-based +Knowledge Distillation (AKD) method to transfer refined knowledge from +large-scale to compressed, pocket-size MLLMs. Furthermore, we propose a Dynamic +Weight update Compression (DWC) strategy for the downlink, which adaptively +selects and quantizes updated weight parameters, enhancing transmission +efficiency and reducing the representational disparity between cloud and device +models. Extensive experiments on several multimodal benchmarks demonstrate the +superiority of our proposed framework over prior Knowledge Distillation and +device-cloud collaboration methods. Notably, we also validate the feasibility +of our approach to real-world experiments.",cs.CV,['cs.CV'] +Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors,Ziqin Zhou · Hai-Ming Xu · Yangyang Shu · Lingqiao Liu, ,https://arxiv.org/abs/2404.02117,,2404.02117.pdf,Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners,"Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model +to learn new classes incrementally without forgetting when only a few samples +for each class are given. FSCIL encounters two significant challenges: +catastrophic forgetting and overfitting, and these challenges have driven prior +studies to primarily rely on shallow models, such as ResNet-18. Even though +their limited capacity can mitigate both forgetting and overfitting issues, it +leads to inadequate knowledge transfer during few-shot incremental sessions. In +this paper, we argue that large models such as vision and language transformers +pre-trained on large datasets can be excellent few-shot incremental learners. +To this end, we propose a novel FSCIL framework called PriViLege, Pre-trained +Vision and Language transformers with prompting functions and knowledge +distillation. Our framework effectively addresses the challenges of +catastrophic forgetting and overfitting in large models through new pre-trained +knowledge tuning (PKT) and two losses: entropy-based divergence loss and +semantic knowledge distillation loss. Experimental results show that the +proposed PriViLege significantly outperforms the existing state-of-the-art +methods with a large margin, e.g., +9.38% in CUB200, +20.58% in CIFAR-100, and ++13.36% in miniImageNet. Our implementation code is available at +https://github.com/KHU-AGI/PriViLege.",cs.CV,['cs.CV'] +LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels,Tuo Feng · Wenguan Wang · Fan Ma · Yi Yang,https://github.com/FengZicai/LSK3DNet,https://arxiv.org/abs/2403.15173,,2403.15173.pdf,LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels,"Autonomous systems need to process large-scale, sparse, and irregular point +clouds with limited compute resources. Consequently, it is essential to develop +LiDAR perception methods that are both efficient and effective. Although +naively enlarging 3D kernel size can enhance performance, it will also lead to +a cubically-increasing overhead. Therefore, it is crucial to develop +streamlined 3D large kernel designs that eliminate redundant weights and work +effectively with larger kernels. In this paper, we propose an efficient and +effective Large Sparse Kernel 3D Neural Network (LSK3DNet) that leverages +dynamic pruning to amplify the 3D kernel size. Our method comprises two core +components: Spatial-wise Dynamic Sparsity (SDS) and Channel-wise Weight +Selection (CWS). SDS dynamically prunes and regrows volumetric weights from the +beginning to learn a large sparse 3D kernel. It not only boosts performance but +also significantly reduces model size and computational cost. Moreover, CWS +selects the most important channels for 3D convolution during training and +subsequently prunes the redundant channels to accelerate inference for 3D +vision tasks. We demonstrate the effectiveness of LSK3DNet on three benchmark +datasets and five tracks compared with classical models and large kernel +designs. Notably, LSK3DNet achieves the state-of-the-art performance on +SemanticKITTI (i.e., 75.6% on single-scan and 63.4% on multi-scan), with +roughly 40% model size reduction and 60% computing operations reduction +compared to the naive large 3D kernel model.",cs.CV,['cs.CV'] +On the Robustness of Large Multimodal Models Against Image Adversarial Attacks,Xuanming Cui · Alejandro Aparcedo · Young Kyun Jang · Ser-Nam Lim, ,https://arxiv.org/abs/2312.03777,,2312.03777.pdf,On the Robustness of Large Multimodal Models Against Image Adversarial Attacks,"Recent advances in instruction tuning have led to the development of +State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these +models, the impact of visual adversarial attacks on LMMs has not been +thoroughly examined. We conduct a comprehensive study of the robustness of +various LMMs against different adversarial attacks, evaluated across tasks +including image classification, image captioning, and Visual Question Answer +(VQA). We find that in general LMMs are not robust to visual adversarial +inputs. However, our findings suggest that context provided to the model via +prompts, such as questions in a QA pair helps to mitigate the effects of visual +adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable +resilience to such attacks on the ScienceQA task with only an 8.10% drop in +performance compared to their visual counterparts which dropped 99.73%. We also +propose a new approach to real-world image classification which we term query +decomposition. By incorporating existence queries into our input prompt we +observe diminished attack effectiveness and improvements in image +classification accuracy. This research highlights a previously under-explored +facet of LMM robustness and sets the stage for future work aimed at +strengthening the resilience of multimodal systems in adversarial environments.",cs.CV,['cs.CV'] +Amodal Ground Truth and Completion in the Wild,Guanqi Zhan · Chuanxia Zheng · Weidi Xie · Andrew Zisserman,https://www.robots.ox.ac.uk/~vgg/research/amodal/,https://arxiv.org/abs/2312.17247,,2312.17247.pdf,Amodal Ground Truth and Completion in the Wild,"This paper studies amodal image segmentation: predicting entire object +segmentation masks including both visible and invisible (occluded) parts. In +previous work, the amodal segmentation ground truth on real images is usually +predicted by manual annotaton and thus is subjective. In contrast, we use 3D +data to establish an automatic pipeline to determine authentic ground truth +amodal masks for partially occluded objects in real images. This pipeline is +used to construct an amodal completion evaluation benchmark, MP3D-Amodal, +consisting of a variety of object categories and labels. To better handle the +amodal completion task in the wild, we explore two architecture variants: a +two-stage model that first infers the occluder, followed by amodal mask +completion; and a one-stage model that exploits the representation power of +Stable Diffusion for amodal segmentation across many categories. Without bells +and whistles, our method achieves a new state-of-the-art performance on Amodal +segmentation datasets that cover a large variety of objects, including COCOA +and our new MP3D-Amodal dataset. The dataset, model, and code are available at +https://www.robots.ox.ac.uk/~vgg/research/amodal/.",cs.CV,['cs.CV'] +MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction,Xiaolu Liu · Song Wang · Wentong Li · Ruizi Yang · Junbo Chen · Jianke Zhu,https://github.com/xiaolul2/MGMap,https://arxiv.org/abs/2404.00876,,2404.00876.pdf,MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction,"Currently, high-definition (HD) map construction leans towards a lightweight +online generation tendency, which aims to preserve timely and reliable road +scene information. However, map elements contain strong shape priors. Subtle +and sparse annotations make current detection-based frameworks ambiguous in +locating relevant feature scopes and cause the loss of detailed structures in +prediction. To alleviate these problems, we propose MGMap, a mask-guided +approach that effectively highlights the informative regions and achieves +precise map element localization by introducing the learned masks. +Specifically, MGMap employs learned masks based on the enhanced multi-scale BEV +features from two perspectives. At the instance level, we propose the +Mask-activated instance (MAI) decoder, which incorporates global instance and +structural information into instance queries by the activation of instance +masks. At the point level, a novel position-guided mask patch refinement +(PG-MPR) module is designed to refine point locations from a finer-grained +perspective, enabling the extraction of point-specific patch information. +Compared to the baselines, our proposed MGMap achieves a notable improvement of +around 10 mAP for different input modalities. Extensive experiments also +demonstrate that our approach showcases strong robustness and generalization +capabilities. Our code can be found at https://github.com/xiaolul2/MGMap.",cs.CV,['cs.CV'] +SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder,Dihan Zheng · Yihang Zou · Xiaowen Zhang · Chenglong Bao, ,https://arxiv.org/abs/2403.17502,,2403.17502.pdf,SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder,"The data bottleneck has emerged as a fundamental challenge in learning based +image restoration methods. Researchers have attempted to generate synthesized +training data using paired or unpaired samples to address this challenge. This +study proposes SeNM-VAE, a semi-supervised noise modeling method that leverages +both paired and unpaired datasets to generate realistic degraded data. Our +approach is based on modeling the conditional distribution of degraded and +clean images with a specially designed graphical model. Under the variational +inference framework, we develop an objective function for handling both paired +and unpaired data. We employ our method to generate paired training samples for +real-world image denoising and super-resolution tasks. Our approach excels in +the quality of synthetic degraded images compared to other unpaired and paired +noise modeling methods. Furthermore, our approach demonstrates remarkable +performance in downstream image restoration tasks, even with limited paired +data. With more paired data, our method achieves the best performance on the +SIDD dataset.",cs.CV,['cs.CV'] +What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation,Yihua Cheng · Yaning Zhu · Zongji Wang · hongquan hao · Liu wei · Shiqing Cheng · Xi Wang · Hyung Jin Chang,https://yihua.zone/work/ivgaze/,https://arxiv.org/abs/2403.15664,,2403.15664.pdf,What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation,"Driver's eye gaze holds a wealth of cognitive and intentional cues crucial +for intelligent vehicles. Despite its significance, research on in-vehicle gaze +estimation remains limited due to the scarcity of comprehensive and +well-annotated datasets in real driving scenarios. In this paper, we present +three novel elements to advance in-vehicle gaze research. Firstly, we introduce +IVGaze, a pioneering dataset capturing in-vehicle gaze, collected from 125 +subjects and covering a large range of gaze and head poses within vehicles. +Conventional gaze collection systems are inadequate for in-vehicle use. In this +dataset, we propose a new vision-based solution for in-vehicle gaze collection, +introducing a refined gaze target calibration method to tackle annotation +challenges. Second, our research focuses on in-vehicle gaze estimation +leveraging the IVGaze. In-vehicle face images often suffer from low resolution, +prompting our introduction of a gaze pyramid transformer that leverages +transformer-based multilevel features integration. Expanding upon this, we +introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing +perspective transformation, we rotate virtual cameras to normalize images, +utilizing camera pose to merge normalized and original images for accurate gaze +estimation. GazeDPTR shows state-of-the-art performance on the IVGaze dataset. +Thirdly, we explore a novel strategy for gaze zone classification by extending +the GazeDPTR. A foundational tri-plane and project gaze onto these planes are +newly defined. Leveraging both positional features from the projection points +and visual attributes from images, we achieve superior performance compared to +relying solely on visual features, substantiating the advantage of gaze +estimation. Our project is available at https://yihua.zone/work/ivgaze.",cs.CV,['cs.CV'] +LangSplat: 3D Language Gaussian Splatting,Minghan Qin · Wanhua Li · Jiawei ZHOU · Haoqian Wang · Hanspeter Pfister,https://langsplat.github.io/,https://arxiv.org/abs/2312.16084,,2312.16084.pdf,LangSplat: 3D Language Gaussian Splatting,"Humans live in a 3D world and commonly use natural language to interact with +a 3D scene. Modeling a 3D language field to support open-ended language queries +in 3D has gained increasing attention recently. This paper introduces +LangSplat, which constructs a 3D language field that enables precise and +efficient open-vocabulary querying within 3D spaces. Unlike existing methods +that ground CLIP language embeddings in a NeRF model, LangSplat advances the +field by utilizing a collection of 3D Gaussians, each encoding language +features distilled from CLIP, to represent the language field. By employing a +tile-based splatting technique for rendering language features, we circumvent +the costly rendering process inherent in NeRF. Instead of directly learning +CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and +then learns language features on the scene-specific latent space, thereby +alleviating substantial memory demands imposed by explicit modeling. Existing +methods struggle with imprecise and vague 3D language fields, which fail to +discern clear boundaries between objects. We delve into this issue and propose +to learn hierarchical semantics using SAM, thereby eliminating the need for +extensively querying the language field across various scales and the +regularization of DINO features. Extensive experimental results show that +LangSplat significantly outperforms the previous state-of-the-art method LERF +by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 +$\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We +strongly recommend readers to check out our video results at +https://langsplat.github.io/",cs.CV,['cs.CV'] +DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching,Shuzhe Wang · Juho Kannala · Daniel Barath, ,https://arxiv.org/abs/2306.12547,,2306.12547.pdf,DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching,"Matching 2D keypoints in an image to a sparse 3D point cloud of the scene +without requiring visual descriptors has garnered increased interest due to its +low memory requirements, inherent privacy preservation, and reduced need for +expensive 3D model maintenance compared to visual descriptor-based methods. +However, existing algorithms often compromise on performance, resulting in a +significant deterioration compared to their descriptor-based counterparts. In +this paper, we introduce DGC-GNN, a novel algorithm that employs a +global-to-local Graph Neural Network (GNN) that progressively exploits +geometric and color cues to represent keypoints, thereby improving matching +accuracy. Our procedure encodes both Euclidean and angular relations at a +coarse level, forming the geometric embedding to guide the point matching. We +evaluate DGC-GNN on both indoor and outdoor datasets, demonstrating that it not +only doubles the accuracy of the state-of-the-art visual descriptor-free +algorithm but also substantially narrows the performance gap between +descriptor-based and descriptor-free methods.",cs.CV,['cs.CV'] +DiffForensics: Leveraging Diffusion Prior to Image Forgery Detection and Localization,Zeqin Yu · Jiangqun Ni · Yuzhen Lin · Haoyi Deng · Bin Li, ,https://arxiv.org/abs/2401.15859,,2401.15859.pdf,Diffusion Facial Forgery Detection,"Detecting diffusion-generated images has recently grown into an emerging +research area. Existing diffusion-based datasets predominantly focus on general +image generation. However, facial forgeries, which pose a more severe social +risk, have remained less explored thus far. To address this gap, this paper +introduces DiFF, a comprehensive dataset dedicated to face-focused +diffusion-generated images. DiFF comprises over 500,000 images that are +synthesized using thirteen distinct generation methods under four conditions. +In particular, this dataset leverages 30,000 carefully collected textual and +visual prompts, ensuring the synthesis of images with both high fidelity and +semantic consistency. We conduct extensive experiments on the DiFF dataset via +a human test and several representative forgery detection methods. The results +demonstrate that the binary detection accuracy of both human observers and +automated detectors often falls below 30%, shedding light on the challenges in +detecting diffusion-generated facial forgeries. Furthermore, we propose an edge +graph regularization approach to effectively enhance the generalization +capability of existing detectors.",cs.CV,"['cs.CV', 'cs.AI']" +Affine Equivariant Networks Based on Differential Invariants,Yikang Li · Yeqing Qiu · Yuxuan Chen · Lingshen He · Zhouchen Lin, ,,https://www.semanticscholar.org/paper/Lie-Group-Decompositions-for-Equivariant-Neural-Mironenco-Forr'e/5302620834b3969b11097f66375cadbf9ee9c817,,,,,nan +EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World,Yifei Huang · Guo Chen · Jilan Xu · Mingfang Zhang · Lijin Yang · Baoqi Pei · Hongjie Zhang · Lu Dong · Yali Wang · Limin Wang · Yu Qiao,https://github.com/OpenGVLab/EgoExoLearn,https://arxiv.org/abs/2403.16182,,2403.16182.pdf,EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World,"Being able to map the activities of others into one's own point of view is +one fundamental human skill even from a very early age. Taking a step toward +understanding this human ability, we introduce EgoExoLearn, a large-scale +dataset that emulates the human demonstration following process, in which +individuals record egocentric videos as they execute tasks guided by +demonstration videos. Focusing on the potential applications in daily +assistance and professional support, EgoExoLearn contains egocentric and +demonstration video data spanning 120 hours captured in daily life scenarios +and specialized laboratories. Along with the videos we record high-quality gaze +data and provide detailed multimodal annotations, formulating a playground for +modeling the human ability to bridge asynchronous procedural actions from +different viewpoints. To this end, we present benchmarks such as cross-view +association, cross-view action planning, and cross-view referenced skill +assessment, along with detailed analysis. We expect EgoExoLearn can serve as an +important resource for bridging the actions across views, thus paving the way +for creating AI agents capable of seamlessly learning by observing humans in +the real world. Code and data can be found at: +https://github.com/OpenGVLab/EgoExoLearn",cs.CV,['cs.CV'] +Learning from Observer Gaze: Zero-shot Attention Prediction Oriented by Human-Object Interaction Recognition,Yuchen Zhou · Linkai Liu · Chao Gou,https://yuchen2199.github.io/Interactive-Gaze/,https://arxiv.org/abs/2405.09931,,2405.09931.pdf,Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition,"Most existing attention prediction research focuses on salient instances like +humans and objects. However, the more complex interaction-oriented attention, +arising from the comprehension of interactions between instances by human +observers, remains largely unexplored. This is equally crucial for advancing +human-machine interaction and human-centered artificial intelligence. To bridge +this gap, we first collect a novel gaze fixation dataset named IG, comprising +530,000 fixation points across 740 diverse interaction categories, capturing +visual attention during human observers cognitive processes of interactions. +Subsequently, we introduce the zero-shot interaction-oriented attention +prediction task ZeroIA, which challenges models to predict visual cues for +interactions not encountered during training. Thirdly, we present the +Interactive Attention model IA, designed to emulate human observers cognitive +processes to tackle the ZeroIA problem. Extensive experiments demonstrate that +the proposed IA outperforms other state-of-the-art approaches in both ZeroIA +and fully supervised settings. Lastly, we endeavor to apply +interaction-oriented attention to the interaction recognition task itself. +Further experimental results demonstrate the promising potential to enhance the +performance and interpretability of existing state-of-the-art HOI models by +incorporating real human attention data from IG and attention labels generated +by IA.",cs.CV,['cs.CV'] +EFHQ: Multi-purpose ExtremePose-Face-HQ dataset,Trung Dao · Duc H Vu · Cuong Pham · Anh Tran,https://bomcon123456.github.io/efhq/,https://arxiv.org/abs/2312.17205,,2312.17205.pdf,EFHQ: Multi-purpose ExtremePose-Face-HQ dataset,"The existing facial datasets, while having plentiful images at near frontal +views, lack images with extreme head poses, leading to the downgraded +performance of deep learning models when dealing with profile or pitched faces. +This work aims to address this gap by introducing a novel dataset named Extreme +Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k +high-quality images of faces at extreme poses. To produce such a massive +dataset, we utilize a novel and meticulous dataset processing pipeline to +curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many +high-resolution face videos captured in various settings. Our dataset can +complement existing datasets on various facial-related tasks, such as facial +synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, +and face reenactment. Specifically, training with EFHQ helps models generalize +well across diverse poses, significantly improving performance in scenarios +involving extreme views, confirmed by extensive experiments. Additionally, we +utilize EFHQ to define a challenging cross-view face verification benchmark, in +which the performance of SOTA face recognition models drops 5-37% compared to +frontal-to-frontal scenarios, aiming to stimulate studies on face recognition +under severe pose conditions in the wild.",cs.CV,['cs.CV'] +Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling,Xinhang Liu · Yu-Wing Tai · Chi-Keung Tang · Pedro Miraldo · Suhas Lohit · Moitreya Chatterjee, ,https://arxiv.org/abs/2405.06214,,2405.06214.pdf,Aerial-NeRF: Adaptive Spatial Partitioning and Sampling for Large-Scale Aerial Rendering,"Recent progress in large-scale scene rendering has yielded Neural Radiance +Fields (NeRF)-based models with an impressive ability to synthesize scenes +across small objects and indoor scenes. Nevertheless, extending this idea to +large-scale aerial rendering poses two critical problems. Firstly, a single +NeRF cannot render the entire scene with high-precision for complex large-scale +aerial datasets since the sampling range along each view ray is insufficient to +cover buildings adequately. Secondly, traditional NeRFs are infeasible to train +on one GPU to enable interactive fly-throughs for modeling massive images. +Instead, existing methods typically separate the whole scene into multiple +regions and train a NeRF on each region, which are unaccustomed to different +flight trajectories and difficult to achieve fast rendering. To that end, we +propose Aerial-NeRF with three innovative modifications for jointly adapting +NeRF in large-scale aerial rendering: (1) Designing an adaptive spatial +partitioning and selection method based on drones' poses to adapt different +flight trajectories; (2) Using similarity of poses instead of (expert) network +for rendering speedup to determine which region a new viewpoint belongs to; (3) +Developing an adaptive sampling approach for rendering performance improvement +to cover the entire buildings at different heights. Extensive experiments have +conducted to verify the effectiveness and efficiency of Aerial-NeRF, and new +state-of-the-art results have been achieved on two public large-scale aerial +datasets and presented SCUTic dataset. Note that our model allows us to perform +rendering over 4 times as fast as compared to multiple competitors. Our +dataset, code, and model are publicly available at https://drliuqi.github.io/.",cs.CV,['cs.CV'] +PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models,Yiming Zhang · Zhening Xing · Yanhong Zeng · Youqing Fang · Kai Chen, ,https://arxiv.org/abs/2312.13964,,2312.13964.pdf,PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models,"Recent advancements in personalized text-to-image (T2I) models have +revolutionized content creation, empowering non-experts to generate stunning +images with unique styles. While promising, adding realistic motions into these +personalized images by text poses significant challenges in preserving distinct +styles, high-fidelity details, and achieving motion controllability by text. In +this paper, we present PIA, a Personalized Image Animator that excels in +aligning with condition images, achieving motion controllability by text, and +the compatibility with various personalized T2I models without specific tuning. +To achieve these goals, PIA builds upon a base T2I model with well-trained +temporal alignment layers, allowing for the seamless transformation of any +personalized T2I model into an image animation model. A key component of PIA is +the introduction of the condition module, which utilizes the condition frame +and inter-frame affinity as input to transfer appearance information guided by +the affinity hint for individual frame synthesis in the latent space. This +design mitigates the challenges of appearance-related image alignment within +and allows for a stronger focus on aligning with motion-related guidance.",cs.CV,"['cs.CV', 'cs.AI']" +Weakly Supervised Video Individual Counting,Xinyan Liu · Guorong Li · Yuankai Qi · Ziheng Yan · Zhenjun Han · Anton van den Hengel · Ming-Hsuan Yang · Qingming Huang, ,https://arxiv.org/abs/2312.05923,,2312.05923.pdf,Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting,"Video Individual Counting (VIC) aims to predict the number of unique +individuals in a single video. % Existing methods learn representations based +on trajectory labels for individuals, which are annotation-expensive. % To +provide a more realistic reflection of the underlying practical challenge, we +introduce a weakly supervised VIC task, wherein trajectory labels are not +provided. Instead, two types of labels are provided to indicate traffic +entering the field of view (inflow) and leaving the field view (outflow). % We +also propose the first solution as a baseline that formulates the task as a +weakly supervised contrastive learning problem under group-level matching. In +doing so, we devise an end-to-end trainable soft contrastive loss to drive the +network to distinguish inflow, outflow, and the remaining. % To facilitate +future study in this direction, we generate annotations from the existing VIC +datasets SenseCrowd and CroHD and also build a new dataset, UAVVIC. % Extensive +results show that our baseline weakly supervised method outperforms supervised +methods, and thus, little information is lost in the transition to the more +practically relevant weakly supervised task. The code and trained model will be +public at \href{https://github.com/streamer-AP/CGNet}{CGNet}",cs.CV,['cs.CV'] +Model Inversion Robustness: Can Transfer Learning Help?,Sy-Tuyen Ho · Koh Jun Hao · Keshigeyan Chandrasegaran · Ngoc-Bao Nguyen · Ngai-Man Cheung, ,https://arxiv.org/abs/2405.05588,,2405.05588.pdf,Model Inversion Robustness: Can Transfer Learning Help?,"Model Inversion (MI) attacks aim to reconstruct private training data by +abusing access to machine learning models. Contemporary MI attacks have +achieved impressive attack performance, posing serious threats to privacy. +Meanwhile, all existing MI defense methods rely on regularization that is in +direct conflict with the training objective, resulting in noticeable +degradation in model utility. In this work, we take a different perspective, +and propose a novel and simple Transfer Learning-based Defense against Model +Inversion (TL-DMI) to render MI-robust models. Particularly, by leveraging TL, +we limit the number of layers encoding sensitive information from private +training dataset, thereby degrading the performance of MI attack. We conduct an +analysis using Fisher Information to justify our method. Our defense is +remarkably simple to implement. Without bells and whistles, we show in +extensive experiments that TL-DMI achieves state-of-the-art (SOTA) MI +robustness. Our code, pre-trained models, demo and inverted data are available +at: https://hosytuyen.github.io/projects/TL-DMI",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']" +$M^3$-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection,Bin Pu · Liwen Wang · Jiewen Yang · He Guannan · Xingbo Dong · Shengli Li · Ying Tan · Ming Chen · Zhe Jin · Kenli Li · Xiaomeng Li, ,https://arxiv.org/abs/2310.14172,,2310.14172.pdf,ASC: Appearance and Structure Consistency for Unsupervised Domain Adaptation in Fetal Brain MRI Segmentation,"Automatic tissue segmentation of fetal brain images is essential for the +quantitative analysis of prenatal neurodevelopment. However, producing +voxel-level annotations of fetal brain imaging is time-consuming and expensive. +To reduce labeling costs, we propose a practical unsupervised domain adaptation +(UDA) setting that adapts the segmentation labels of high-quality fetal brain +atlases to unlabeled fetal brain MRI data from another domain. To address the +task, we propose a new UDA framework based on Appearance and Structure +Consistency, named ASC. We adapt the segmentation model to the appearances of +different domains by constraining the consistency before and after a +frequency-based image transformation, which is to swap the appearance between +brain MRI data and atlases. Consider that even in the same domain, the fetal +brain images of different gestational ages could have significant variations in +the anatomical structures. To make the model adapt to the structural variations +in the target domain, we further encourage prediction consistency under +different structural perturbations. Extensive experiments on FeTA 2021 +benchmark demonstrate the effectiveness of our ASC in comparison to +registration-based, semi-supervised learning-based, and existing UDA-based +methods.",eess.IV,"['eess.IV', 'cs.CV']" +A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?,Galadrielle Humblot-Renaux · Sergio Escalera · Thomas B. Moeslund, ,https://arxiv.org/abs/2404.01775,,2404.01775.pdf,A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?,"The ability to detect unfamiliar or unexpected images is essential for safe +deployment of computer vision systems. In the context of classification, the +task of detecting images outside of a model's training domain is known as +out-of-distribution (OOD) detection. While there has been a growing research +interest in developing post-hoc OOD detection methods, there has been +comparably little discussion around how these methods perform when the +underlying classifier is not trained on a clean, carefully curated dataset. In +this work, we take a closer look at 20 state-of-the-art OOD detection methods +in the (more realistic) scenario where the labels used to train the underlying +classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive +experiments across different datasets, noise types & levels, architectures and +checkpointing strategies provide insights into the effect of class label noise +on OOD detection, and show that poor separation between incorrectly classified +ID samples vs. OOD samples is an overlooked yet important limitation of +existing methods. Code: https://github.com/glhr/ood-labelnoise",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models,Mengcheng Li · Hongwen Zhang · Yuxiang Zhang · Ruizhi Shao · Tao Yu · Yebin Liu,https://www.liuyebin.com/HHMR/HHMR.html,https://arxiv.org/abs/2402.14654,,2402.14654.pdf,Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot,"We present Multi-HMR, a strong single-shot model for multi-person 3D human +mesh recovery from a single RGB image. Predictions encompass the whole body, +i.e, including hands and facial expressions, using the SMPL-X parametric model +and spatial location in the camera coordinate system. Our model detects people +by predicting coarse 2D heatmaps of person centers, using features produced by +a standard Vision Transformer (ViT) backbone. It then predicts their whole-body +pose, shape and spatial location using a new cross-attention module called the +Human Prediction Head (HPH), with one query per detected center token, +attending to the entire set of features. As direct prediction of SMPL-X +parameters yields suboptimal results, we introduce CUFFS; the Close-Up Frames +of Full-Body Subjects dataset, containing humans close to the camera with +diverse hand poses. We show that incorporating this dataset into training +further enhances predictions, particularly for hands, enabling us to achieve +state-of-the-art performance. Multi-HMR also optionally accounts for camera +intrinsics, if available, by encoding camera ray directions for each image +token. This simple design achieves strong performance on whole-body and +body-only benchmarks simultaneously. We train models with various backbone +sizes and input resolutions. In particular, using a ViT-S backbone and +$448\times448$ input images already yields a fast and competitive model with +respect to state-of-the-art methods, while considering larger models and higher +resolutions further improve performance.",cs.CV,['cs.CV'] +C$^\text{2}$RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction,Yiqun Lin · Jiewen Yang · hualiang wang · Xinpeng Ding · Wei Zhao · Xiaomeng Li,https://github.com/xmed-lab/C2RV-CBCT,https://arxiv.org/abs/2312.01689,,2312.01689.pdf,Fast and accurate sparse-view CBCT reconstruction using meta-learned neural attenuation field and hash-encoding regularization,"Cone beam computed tomography (CBCT) is an emerging medical imaging technique +to visualize the internal anatomical structures of patients. During a CBCT +scan, several projection images of different angles or views are collectively +utilized to reconstruct a tomographic image. However, reducing the number of +projections in a CBCT scan while preserving the quality of a reconstructed +image is challenging due to the nature of an ill-posed inverse problem. +Recently, a neural attenuation field (NAF) method was proposed by adopting a +neural radiance field algorithm as a new way for CBCT reconstruction, +demonstrating fast and promising results using only 50 views. However, +decreasing the number of projections is still preferable to reduce potential +radiation exposure, and a faster reconstruction time is required considering a +typical scan time. In this work, we propose a fast and accurate sparse-view +CBCT reconstruction (FACT) method to provide better reconstruction quality and +faster optimization speed in the minimal number of view acquisitions ($<$ 50 +views). In the FACT method, we meta-trained a neural network and a hash-encoder +using a few scans (= 15), and a new regularization technique is utilized to +reconstruct the details of an anatomical structure. In conclusion, we have +shown that the FACT method produced better, and faster reconstruction results +over the other conventional algorithms based on CBCT scans of different body +parts (chest, head, and abdomen) and CT vendors (Siemens, Phillips, and GE).",eess.IV,"['eess.IV', 'cs.CV']" +Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions,Namitha Padmanabhan · Matthew A Gwilliam · Pulkit Kumar · Shishira R Maiya · Max Ehrlich · Abhinav Shrivastava,https://namithap10.github.io/xinc/,https://arxiv.org/abs/2401.10217,,2401.10217.pdf,Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions,"The many variations of Implicit Neural Representations (INRs), where a neural +network is trained as a continuous representation of a signal, have tremendous +practical utility for downstream tasks including novel view synthesis, video +compression, and image superresolution. Unfortunately, the inner workings of +these networks are seriously under-studied. Our work, eXplaining the Implicit +Neural Canvas (XINC), is a unified framework for explaining properties of INRs +by examining the strength of each neuron's contribution to each output pixel. +We call the aggregate of these contribution maps the Implicit Neural Canvas and +we use this concept to demonstrate that the INRs which we study learn to +''see'' the frames they represent in surprising ways. For example, INRs tend to +have highly distributed representations. While lacking high-level object +semantics, they have a significant bias for color and edges, and are almost +entirely space-agnostic. We arrive at our conclusions by examining how objects +are represented across time in video INRs, using clustering to visualize +similar neurons across layers and architectures, and show that this is +dominated by motion. These insights demonstrate the general usefulness of our +analysis framework. Our project page is available at +https://namithap10.github.io/xinc.",cs.CV,['cs.CV'] +Posterior Distillation Sampling,Juil Koo · Chanho Park · Minhyuk Sung,https://posterior-distillation-sampling.github.io/,https://arxiv.org/abs/2311.13831,,2311.13831.pdf,Posterior Distillation Sampling,"We introduce Posterior Distillation Sampling (PDS), a novel optimization +method for parametric image editing based on diffusion models. Existing +optimization-based methods, which leverage the powerful 2D prior of diffusion +models to handle various parametric images, have mainly focused on generation. +Unlike generation, editing requires a balance between conforming to the target +attribute and preserving the identity of the source content. Recent 2D image +editing methods have achieved this balance by leveraging the stochastic latent +encoded in the generative process of diffusion models. To extend the editing +capabilities of diffusion models shown in pixel space to parameter space, we +reformulate the 2D image editing method into an optimization form named PDS. +PDS matches the stochastic latents of the source and the target, enabling the +sampling of targets in diverse parameter spaces that align with a desired +attribute while maintaining the source's identity. We demonstrate that this +optimization resembles running a generative process with the target attribute, +but aligning this process with the trajectory of the source's generative +process. Extensive editing results in Neural Radiance Fields and Scalable +Vector Graphics representations demonstrate that PDS is capable of sampling +targets to fulfill the aforementioned balance across various parameter spaces.",cs.CV,['cs.CV'] +Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices,Huancheng Chen · Haris Vikalo, ,https://arxiv.org/abs/2311.18129,,2311.18129.pdf,Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices,"While federated learning (FL) systems often utilize quantization to battle +communication and computational bottlenecks, they have heretofore been limited +to deploying fixed-precision quantization schemes. Meanwhile, the concept of +mixed-precision quantization (MPQ), where different layers of a deep learning +model are assigned varying bit-width, remains unexplored in the FL settings. We +present a novel FL algorithm, FedMPQ, which introduces mixed-precision +quantization to resource-heterogeneous FL systems. Specifically, local models, +quantized so as to satisfy bit-width constraint, are trained by optimizing an +objective function that includes a regularization term which promotes reduction +of precision in some of the layers without significant performance degradation. +The server collects local model updates, de-quantizes them into full-precision +models, and then aggregates them into a global model. To initialize the next +round of local training, the server relies on the information learned in the +previous training round to customize bit-width assignments of the models +delivered to different clients. In extensive benchmarking experiments on +several model architectures and different datasets in both iid and non-iid +settings, FedMPQ outperformed the baseline FL schemes that utilize +fixed-precision quantization while incurring only a minor computational +overhead on the participating devices.",cs.LG,"['cs.LG', 'cs.DC']" +Coherent Temporal Synthesis for Incremental Action Segmentation,Guodong Ding · Hans Golong · Angela Yao,https://guodongding.cn/projects/itas/itas.html,https://arxiv.org/abs/2403.06102,,2403.06102.pdf,Coherent Temporal Synthesis for Incremental Action Segmentation,"Data replay is a successful incremental learning technique for images. It +prevents catastrophic forgetting by keeping a reservoir of previous data, +original or synthesized, to ensure the model retains past knowledge while +adapting to novel concepts. However, its application in the video domain is +rudimentary, as it simply stores frame exemplars for action recognition. This +paper presents the first exploration of video data replay techniques for +incremental action segmentation, focusing on action temporal modeling. We +propose a Temporally Coherent Action (TCA) model, which represents actions +using a generative model instead of storing individual frames. The integration +of a conditioning variable that captures temporal coherence allows our model to +understand the evolution of action features over time. Therefore, action +segments generated by TCA for replay are diverse and temporally coherent. In a +10-task incremental setup on the Breakfast dataset, our approach achieves +significant increases in accuracy for up to 22% compared to the baselines.",cs.CV,['cs.CV'] +GLACE: Global Local Accelerated Coordinate Encoding,Fangjinhua Wang · Xudong Jiang · Silvano Galliani · Christoph Vogel · Marc Pollefeys, ,,https://ieeexplore.ieee.org/document/10204902/figures,,,,,nan +Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation,Mingyu Lee · Jongwon Choi,https://github.com/MingyuLee82/TGI_AD_v1,https://arxiv.org/abs/2403.06247,,2403.06247.pdf,Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation,"We propose a text-guided variational image generation method to address the +challenge of getting clean data for anomaly detection in industrial +manufacturing. Our method utilizes text information about the target object, +learned from extensive text library documents, to generate non-defective data +images resembling the input image. The proposed framework ensures that the +generated non-defective images align with anticipated distributions derived +from textual and image-based knowledge, ensuring stability and generality. +Experimental results demonstrate the effectiveness of our approach, surpassing +previous methods even with limited non-defective data. Our approach is +validated through generalization tests across four baseline models and three +distinct datasets. We present an additional analysis to enhance the +effectiveness of anomaly detection models by utilizing the generated images.",cs.CV,"['cs.CV', 'cs.AI']" +Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models,Bin Fu · Fanghua Yu · Anran Liu · Zixuan Wang · Jie Wen · Junjun He · Yu Qiao, ,https://arxiv.org/abs/2312.12142,,2312.12142.pdf,FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning,"Automatic font generation is an imitation task, which aims to create a font +library that mimics the style of reference images while preserving the content +from source images. Although existing font generation methods have achieved +satisfactory performance, they still struggle with complex characters and large +style variations. To address these issues, we propose FontDiffuser, a +diffusion-based image-to-image one-shot font generation method, which +innovatively models the font imitation task as a noise-to-denoise paradigm. In +our method, we introduce a Multi-scale Content Aggregation (MCA) block, which +effectively combines global and local content cues across different scales, +leading to enhanced preservation of intricate strokes of complex characters. +Moreover, to better manage the large variations in style transfer, we propose a +Style Contrastive Refinement (SCR) module, which is a novel structure for style +representation learning. It utilizes a style extractor to disentangle styles +from images, subsequently supervising the diffusion model via a meticulously +designed style contrastive loss. Extensive experiments demonstrate +FontDiffuser's state-of-the-art performance in generating diverse characters +and styles. It consistently excels on complex characters and large style +changes compared to previous methods. The code is available at +https://github.com/yeungchenwa/FontDiffuser.",cs.CV,"['cs.CV', 'cs.AI']" +PTQ4SAM: Post-Training Quantization for Segment Anything,Chengtao Lv · Hong Chen · Jinyang Guo · Yifu Ding · Xianglong Liu, ,https://arxiv.org/abs/2405.03144,,2405.03144.pdf,PTQ4SAM: Post-Training Quantization for Segment Anything,"Segment Anything Model (SAM) has achieved impressive performance in many +computer vision tasks. However, as a large-scale model, the immense memory and +computation costs hinder its practical deployment. In this paper, we propose a +post-training quantization (PTQ) framework for Segment Anything Model, namely +PTQ4SAM. First, we investigate the inherent bottleneck of SAM quantization +attributed to the bimodal distribution in post-Key-Linear activations. We +analyze its characteristics from both per-tensor and per-channel perspectives, +and propose a Bimodal Integration strategy, which utilizes a mathematically +equivalent sign operation to transform the bimodal distribution into a +relatively easy-quantized normal distribution offline. Second, SAM encompasses +diverse attention mechanisms (i.e., self-attention and two-way +cross-attention), resulting in substantial variations in the post-Softmax +distributions. Therefore, we introduce an Adaptive Granularity Quantization for +Softmax through searching the optimal power-of-two base, which is +hardware-friendly. Extensive experimental results across various vision tasks +(instance segmentation, semantic segmentation and object detection), datasets +and model variants show the superiority of PTQ4SAM. For example, when +quantizing SAM-L to 6-bit, we achieve lossless accuracy for instance +segmentation, about 0.5\% drop with theoretical 3.9$\times$ acceleration. The +code is available at \url{https://github.com/chengtao-lv/PTQ4SAM}.",cs.CV,"['cs.CV', 'cs.LG']" +Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach,Mir Hossain Hossain · Mennatullah Siam · Leonid Sigal · Jim Little, ,https://arxiv.org/abs/2404.11732,,2404.11732.pdf,Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach,"The emergence of attention-based transformer models has led to their +extensive use in various tasks, due to their superior generalization and +transfer properties. Recent research has demonstrated that such models, when +prompted appropriately, are excellent for few-shot inference. However, such +techniques are under-explored for dense prediction tasks like semantic +segmentation. In this work, we examine the effectiveness of prompting a +transformer-decoder with learned visual prompts for the generalized few-shot +segmentation (GFSS) task. Our goal is to achieve strong performance not only on +novel categories with limited examples, but also to retain performance on base +categories. We propose an approach to learn visual prompts with limited +examples. These learned visual prompts are used to prompt a multiscale +transformer decoder to facilitate accurate dense predictions. Additionally, we +introduce a unidirectional causal attention mechanism between the novel +prompts, learned with limited examples, and the base prompts, learned with +abundant data. This mechanism enriches the novel prompts without deteriorating +the base class performance. Overall, this form of prompting helps us achieve +state-of-the-art performance for GFSS on two different benchmark datasets: +COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or +transduction). Furthermore, test-time optimization leveraging unlabelled test +data can be used to improve the prompts, which we refer to as transductive +prompt tuning.",cs.CV,['cs.CV'] +Precise Image Editing via Recognition and Generation Tasks,Shelly Sheynin · Adam Polyak · Uriel Singer · Yuval Kirstain · Amit Zohar · Oron Ashual · Devi Parikh · Yaniv Taigman,https://emu-edit.metademolab.com/,https://arxiv.org/abs/2311.10089,,2311.10089.pdf,Emu Edit: Precise Image Editing via Recognition and Generation Tasks,"Instruction-based image editing holds immense potential for a variety of +applications, as it enables users to perform any editing operation using a +natural language instruction. However, current models in this domain often +struggle with accurately executing user instructions. We present Emu Edit, a +multi-task image editing model which sets state-of-the-art results in +instruction-based image editing. To develop Emu Edit we train it to multi-task +across an unprecedented range of tasks, such as region-based editing, free-form +editing, and Computer Vision tasks, all of which are formulated as generative +tasks. Additionally, to enhance Emu Edit's multi-task learning abilities, we +provide it with learned task embeddings which guide the generation process +towards the correct edit type. Both these elements are essential for Emu Edit's +outstanding performance. Furthermore, we show that Emu Edit can generalize to +new tasks, such as image inpainting, super-resolution, and compositions of +editing tasks, with just a few labeled examples. This capability offers a +significant advantage in scenarios where high-quality samples are scarce. +Lastly, to facilitate a more rigorous and informed assessment of instructable +image editing models, we release a new challenging and versatile benchmark that +includes seven different image editing tasks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +UniVS: Unified and Universal Video Segmentation with Prompts as Queries,Minghan LI · Shuai Li · Xindong Zhang · Lei Zhang, ,https://arxiv.org/abs/2402.18115,,2402.18115.pdf,UniVS: Unified and Universal Video Segmentation with Prompts as Queries,"Despite the recent advances in unified image segmentation (IS), developing a +unified video segmentation (VS) model remains a challenge. This is mainly +because generic category-specified VS tasks need to detect all objects and +track them across consecutive frames, while prompt-guided VS tasks require +re-identifying the target with visual/text prompts throughout the entire video, +making it hard to handle the different tasks with the same architecture. We +make an attempt to address these issues and present a novel unified VS +architecture, namely UniVS, by using prompts as queries. UniVS averages the +prompt features of the target from previous frames as its initial query to +explicitly decode masks, and introduces a target-wise prompt cross-attention +layer in the mask decoder to integrate prompt features in the memory pool. By +taking the predicted masks of entities from previous frames as their visual +prompts, UniVS converts different VS tasks into prompt-guided target +segmentation, eliminating the heuristic inter-frame matching process. Our +framework not only unifies the different VS tasks but also naturally achieves +universal training and testing, ensuring robust performance across different +scenarios. UniVS shows a commendable balance between performance and +universality on 10 challenging VS benchmarks, covering video instance, +semantic, panoptic, object, and referring segmentation tasks. Code can be found +at \url{https://github.com/MinghanLi/UniVS}.",cs.CV,"['cs.CV', 'cs.CL']" +A-Teacher: Asymmetric Network for 3D Semi-Supervised Object Detection,Hanshi Wang · Zhipeng Zhang · Jin Gao · Weiming Hu, ,https://arxiv.org/abs/2401.05011,,2401.05011.pdf,Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection,"Semi-supervised 3D object detection is a promising yet under-explored +direction to reduce data annotation costs, especially for cluttered indoor +scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this +task by utilizing a teacher model to generate pseudo-labels for unlabeled +samples. However, the availability of unlabeled samples in the 3D domain is +relatively limited compared to its 2D counterpart due to the greater effort +required to collect 3D data. Moreover, the loose consistency regularization in +SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to +either low-quality supervision or a limited amount of pseudo labels. To address +these issues, we present a novel Dual-Perspective Knowledge Enrichment approach +named DPKE for semi-supervised 3D object detection. Our DPKE enriches the +knowledge of limited training data, particularly unlabeled data, from two +perspectives: data-perspective and feature-perspective. Specifically, from the +data-perspective, we propose a class-probabilistic data augmentation method +that augments the input data with additional instances based on the varying +distribution of class probabilities. Our DPKE achieves feature-perspective +knowledge enrichment by designing a geometry-aware feature matching method that +regularizes feature-level similarity between object proposals from the student +and teacher models. Extensive experiments on the two benchmark datasets +demonstrate that our DPKE achieves superior performance over existing +state-of-the-art approaches under various label ratio conditions. The source +code will be made available to the public.",cs.CV,['cs.CV'] +MRFS: Mutually Reinforcing Image Fusion and Segmentation,HAO ZHANG · Xuhui Zuo · Jie Jiang · Chunchao Guo · Jiayi Ma, ,,https://ojs.aaai.org/index.php/AAAI/article/view/28536,,,,,nan +OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning,Haiyang Ying · Yixuan Yin · Jinzhi Zhang · Fan Wang · Tao Yu · Ruqi Huang · Lu Fang,https://oceanying.github.io/OmniSeg3D/,https://arxiv.org/abs/2311.11666,,2311.11666.pdf,OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning,"Towards holistic understanding of 3D scenes, a general 3D segmentation method +is needed that can segment diverse objects without restrictions on object +quantity or categories, while also reflecting the inherent hierarchical +structure. To achieve this, we propose OmniSeg3D, an omniversal segmentation +method aims for segmenting anything in 3D all at once. The key insight is to +lift multi-view inconsistent 2D segmentations into a consistent 3D feature +field through a hierarchical contrastive learning framework, which is +accomplished by two steps. Firstly, we design a novel hierarchical +representation based on category-agnostic 2D segmentations to model the +multi-level relationship among pixels. Secondly, image features rendered from +the 3D feature field are clustered at different levels, which can be further +drawn closer or pushed apart according to the hierarchical relationship between +different levels. In tackling the challenges posed by inconsistent 2D +segmentations, this framework yields a global consistent 3D feature field, +which further enables hierarchical segmentation, multi-object selection, and +global discretization. Extensive experiments demonstrate the effectiveness of +our method on high-quality 3D segmentation and accurate hierarchical structure +understanding. A graphical user interface further facilitates flexible +interaction for omniversal 3D segmentation.",cs.CV,['cs.CV'] +Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance,Kelvin C.K. Chan · Yang Zhao · Xuhui Jia · Ming-Hsuan Yang · Huisheng Wang, ,https://arxiv.org/abs/2405.01356,,2405.01356.pdf,Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance,"In subject-driven text-to-image synthesis, the synthesis process tends to be +heavily influenced by the reference images provided by users, often overlooking +crucial attributes detailed in the text prompt. In this work, we propose +Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the +problem. We show that through constructing a subject-agnostic condition and +applying our proposed dual classifier-free guidance, one could obtain outputs +consistent with both the given subject and input text prompts. We validate the +efficacy of our approach through both optimization-based and encoder-based +methods. Additionally, we demonstrate its applicability in second-order +customization methods, where an encoder-based model is fine-tuned with +DreamBooth. Our approach is conceptually simple and requires only minimal code +modifications, but leads to substantial quality improvements, as evidenced by +our evaluations and user studies.",cs.CV,['cs.CV'] +DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement,Hao Wu · Huabin Liu · Yu Qiao · Xiao Sun, ,https://arxiv.org/abs/2404.02755,,2404.02755.pdf,DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement,"We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for +dense video captioning (DVC), that elaborates on improving the quality of the +generated event captions and their associated pseudo event boundaries from +unlabeled videos. By leveraging the capabilities of diverse large language +models (LLMs), we generate rich DVC-oriented caption candidates and optimize +the corresponding pseudo boundaries under several meticulously designed +objectives, considering diversity, event-centricity, temporal ordering, and +coherence. Moreover, we further introduce a novel online boundary refinement +strategy that iteratively improves the quality of pseudo boundaries during +training. Comprehensive experiments have been conducted to examine the +effectiveness of the proposed technique components. By leveraging a substantial +amount of unlabeled video data, such as HowTo100M, we achieve a remarkable +advancement on standard DVC datasets like YouCook2 and ActivityNet. We +outperform the previous state-of-the-art Vid2Seq across a majority of metrics, +achieving this with just 0.4% of the unlabeled video data used for pre-training +by Vid2Seq.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']" +AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution,Cheeun Hong · Kyoung Mu Lee, ,https://arxiv.org/abs/2404.03296,,2404.03296.pdf,AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution,"Although image super-resolution (SR) problem has experienced unprecedented +restoration accuracy with deep neural networks, it has yet limited versatile +applications due to the substantial computational costs. Since different input +images for SR face different restoration difficulties, adapting computational +costs based on the input image, referred to as adaptive inference, has emerged +as a promising solution to compress SR networks. Specifically, adapting the +quantization bit-widths has successfully reduced the inference and memory cost +without sacrificing the accuracy. However, despite the benefits of the +resultant adaptive network, existing works rely on time-intensive +quantization-aware training with full access to the original training pairs to +learn the appropriate bit allocation policies, which limits its ubiquitous +usage. To this end, we introduce the first on-the-fly adaptive quantization +framework that accelerates the processing time from hours to seconds. We +formulate the bit allocation problem with only two bit mapping modules: one to +map the input image to the image-wise bit adaptation factor and one to obtain +the layer-wise adaptation factors. These bit mappings are calibrated and +fine-tuned using only a small number of calibration images. We achieve +competitive performance with the previous adaptive quantization methods, while +the processing time is accelerated by x2000. Codes are available at +https://github.com/Cheeun/AdaBM.",cs.CV,"['cs.CV', 'eess.IV']" +Residual Denoising Diffusion Models,Jiawei Liu · Qiang Wang · Huijie Fan · Yinong Wang · Yandong Tang · Liangqiong Qu,https://github.com/nachifur/RDDM,https://arxiv.org/abs/2308.13712,,,Residual Denoising Diffusion Models,"We propose residual denoising diffusion models (RDDM), a novel dual diffusion +process that decouples the traditional single denoising diffusion process into +residual diffusion and noise diffusion. This dual diffusion framework expands +the denoising-based diffusion models, initially uninterpretable for image +restoration, into a unified and interpretable model for both image generation +and restoration by introducing residuals. Specifically, our residual diffusion +represents directional diffusion from the target image to the degraded input +image and explicitly guides the reverse generation process for image +restoration, while noise diffusion represents random perturbations in the +diffusion process. The residual prioritizes certainty, while the noise +emphasizes diversity, enabling RDDM to effectively unify tasks with varying +certainty or diversity requirements, such as image generation and restoration. +We demonstrate that our sampling process is consistent with that of DDPM and +DDIM through coefficient transformation, and propose a partially +path-independent generation process to better understand the reverse process. +Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a +batch size of 1, to compete with state-of-the-art image restoration methods. We +provide code and pre-trained models to encourage further exploration, +application, and development of our innovative framework +(https://github.com/nachifur/RDDM).",cs.CV,"['cs.CV', 'cs.LG']" +Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse,Yining Wang · Junjie Sun · Chenyue Wang · Mi Zhang · Min Yang, ,https://arxiv.org/abs/2405.05587,,2405.05587.pdf,Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse,"Recent studies have noted an intriguing phenomenon termed Neural Collapse, +that is, when the neural networks establish the right correlation between +feature spaces and the training targets, their last-layer features, together +with the classifier weights, will collapse into a stable and symmetric +structure. In this paper, we extend the investigation of Neural Collapse to the +biased datasets with imbalanced attributes. We observe that models will easily +fall into the pitfall of shortcut learning and form a biased, non-collapsed +feature space at the early period of training, which is hard to reverse and +limits the generalization capability. To tackle the root cause of biased +classification, we follow the recent inspiration of prime training, and propose +an avoid-shortcut learning framework without additional training complexity. +With well-designed shortcut primes based on Neural Collapse structure, the +models are encouraged to skip the pursuit of simple shortcuts and naturally +capture the intrinsic correlations. Experimental results demonstrate that our +method induces better convergence properties during training, and achieves +state-of-the-art generalization performance on both synthetic and real-world +biased datasets.",cs.CV,"['cs.CV', 'cs.LG']" +TIGER: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process,Zhiyuan Ren · Minchul Kim · Feng Liu · Xiaoming Liu, ,,https://link.springer.com/article/10.1007/s00371-024-03370-x,,,,,nan +IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation,Yizhi Song · Zhifei Zhang · Zhe Lin · Scott Cohen · Brian Price · Jianming Zhang · Soo Ye Kim · He Zhang · Wei Xiong · Daniel Aliaga,https://song630.github.io/IMPRINT-Project-Page/,https://arxiv.org/abs/2403.10701,,2403.10701.pdf,IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation,"Generative object compositing emerges as a promising new avenue for +compositional image editing. However, the requirement of object identity +preservation poses a significant challenge, limiting practical usage of most +existing methods. In response, this paper introduces IMPRINT, a novel +diffusion-based generative model trained with a two-stage learning framework +that decouples learning of identity preservation from that of compositing. The +first stage is targeted for context-agnostic, identity-preserving pretraining +of the object encoder, enabling the encoder to learn an embedding that is both +view-invariant and conducive to enhanced detail preservation. The subsequent +stage leverages this representation to learn seamless harmonization of the +object composited to the background. In addition, IMPRINT incorporates a +shape-guidance mechanism offering user-directed control over the compositing +process. Extensive experiments demonstrate that IMPRINT significantly +outperforms existing methods and various baselines on identity preservation and +composition quality.",cs.CV,['cs.CV'] +Differentiable Micro-Mesh Construction,Yishun Dou · Zhong Zheng · Qiaoqiao Jin · Rui Shi · Yuhan Li · Bingbing Ni, ,http://export.arxiv.org/abs/2310.08332v1,,2310.08332v1.pdf,Real-Time Neural BRDF with Spherically Distributed Primitives,"We propose a novel compact and efficient neural BRDF offering highly +versatile material representation, yet with very-light memory and neural +computation consumption towards achieving real-time rendering. The results in +Figure 1, rendered at full HD resolution on a current desktop machine, show +that our system achieves real-time rendering with a wide variety of +appearances, which is approached by the following two designs. On the one hand, +noting that bidirectional reflectance is distributed in a very sparse +high-dimensional subspace, we propose to project the BRDF into two +low-dimensional components, i.e., two hemisphere feature-grids for incoming and +outgoing directions, respectively. On the other hand, learnable neural +reflectance primitives are distributed on our highly-tailored spherical surface +grid, which offer informative features for each component and alleviate the +conventional heavy feature learning network to a much smaller one, leading to +very fast evaluation. These primitives are centrally stored in a codebook and +can be shared across multiple grids and even across materials, based on the +low-cost indices stored in material-specific spherical surface grids. Our +neural BRDF, which is agnostic to the material, provides a unified framework +that can represent a variety of materials in consistent manner. Comprehensive +experimental results on measured BRDF compression, Monte Carlo simulated BRDF +acceleration, and extension to spatially varying effect demonstrate the +superior quality and generalizability achieved by the proposed scheme.",cs.CV,['cs.CV'] +FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization,Jiahui Zhang · Fangneng Zhan · MUYU XU · Shijian Lu · Eric P. Xing, ,https://arxiv.org/abs/2403.06908v1,,2403.06908v1.pdf,FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization,"3D Gaussian splatting has achieved very impressive performance in real-time +novel view synthesis. However, it often suffers from over-reconstruction during +Gaussian densification where high-variance image regions are covered by a few +large Gaussians only, leading to blur and artifacts in the rendered images. We +design a progressive frequency regularization (FreGS) technique to tackle the +over-reconstruction issue within the frequency space. Specifically, FreGS +performs coarse-to-fine Gaussian densification by exploiting low-to-high +frequency components that can be easily extracted with low-pass and high-pass +filters in the Fourier space. By minimizing the discrepancy between the +frequency spectrum of the rendered image and the corresponding ground truth, it +achieves high-quality Gaussian densification and alleviates the +over-reconstruction of Gaussian splatting effectively. Experiments over +multiple widely adopted benchmarks (e.g., Mip-NeRF360, Tanks-and-Temples and +Deep Blending) show that FreGS achieves superior novel view synthesis and +outperforms the state-of-the-art consistently.",cs.CV,['cs.CV'] +Parameter Efficient Self-Supervised Geospatial Domain Adaptation,Linus Scheibenreif · Michael Mommert · Damian Borth, ,https://arxiv.org/abs/2312.13066,,2312.13066.pdf,PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation,"Self-supervised monocular depth estimation is of significant importance with +applications spanning across autonomous driving and robotics. However, the +reliance on self-supervision introduces a strong static-scene assumption, +thereby posing challenges in achieving optimal performance in dynamic scenes, +which are prevalent in most real-world situations. To address these issues, we +propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to +transfer a pre-trained image model for self-supervised depth estimation. The +training comprises two sequential stages: an initial phase trained on a dataset +primarily composed of static scenes, succeeded by an expansion to more +intricate datasets involving dynamic scenes. To facilitate this process, we +design compact encoder and decoder adapters to enable parameter-efficient +tuning, allowing the network to adapt effectively. They not only uphold +generalized patterns from pre-trained image models but also retain knowledge +gained from the preceding phase into the subsequent one. Extensive experiments +demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, +CityScapes and DDAD datasets.",cs.CV,['cs.CV'] +Question Aware Vision Transformer for Multimodal Reasoning,Roy Ganz · Yair Kittenplon · Aviad Aberdam · Elad Ben Avraham · Oren Nuriel · Shai Mazor · Ron Litman, ,https://arxiv.org/abs/2402.05472,,2402.05472.pdf,Question Aware Vision Transformer for Multimodal Reasoning,"Vision-Language (VL) models have gained significant research focus, enabling +remarkable advances in multimodal reasoning. These architectures typically +comprise a vision encoder, a Large Language Model (LLM), and a projection +module that aligns visual features with the LLM's representation space. Despite +their success, a critical limitation persists: the vision encoding process +remains decoupled from user queries, often in the form of image-related +questions. Consequently, the resulting visual features may not be optimally +attuned to the query-specific elements of the image. To address this, we +introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal +reasoning, which embeds question awareness directly within the vision encoder. +This integration results in dynamic visual features focusing on relevant image +aspects to the posed question. QA-ViT is model-agnostic and can be incorporated +efficiently into any VL architecture. Extensive experiments demonstrate the +effectiveness of applying our method to various multimodal architectures, +leading to consistent improvement across diverse tasks and showcasing its +potential for enhancing visual and scene-text understanding.",cs.CV,['cs.CV'] +Real-Time Neural BRDF with Spherically Distributed Primitives,Yishun Dou · Zhong Zheng · Qiaoqiao Jin · Bingbing Ni · Yugang Chen · Junxiang Ke, ,https://arxiv.org/abs/2310.08332,,2310.08332.pdf,Real-Time Neural BRDF with Spherically Distributed Primitives,"We propose a novel compact and efficient neural BRDF offering highly +versatile material representation, yet with very-light memory and neural +computation consumption towards achieving real-time rendering. The results in +Figure 1, rendered at full HD resolution on a current desktop machine, show +that our system achieves real-time rendering with a wide variety of +appearances, which is approached by the following two designs. On the one hand, +noting that bidirectional reflectance is distributed in a very sparse +high-dimensional subspace, we propose to project the BRDF into two +low-dimensional components, i.e., two hemisphere feature-grids for incoming and +outgoing directions, respectively. On the other hand, learnable neural +reflectance primitives are distributed on our highly-tailored spherical surface +grid, which offer informative features for each component and alleviate the +conventional heavy feature learning network to a much smaller one, leading to +very fast evaluation. These primitives are centrally stored in a codebook and +can be shared across multiple grids and even across materials, based on the +low-cost indices stored in material-specific spherical surface grids. Our +neural BRDF, which is agnostic to the material, provides a unified framework +that can represent a variety of materials in consistent manner. Comprehensive +experimental results on measured BRDF compression, Monte Carlo simulated BRDF +acceleration, and extension to spatially varying effect demonstrate the +superior quality and generalizability achieved by the proposed scheme.",cs.CV,['cs.CV'] +Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning,HAO ZHANG · Linfeng Tang · Xinyu Xiang · Xuhui Zuo · Jiayi Ma, ,,https://github.com/HaoZhang1018/DDBF,,,,,nan +HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention,Xiaolong Tang · Meina Kan · Shiguang Shan · Zhilong Ji · Jinfeng Bai · Xilin Chen, ,https://arxiv.org/abs/2404.06351,,2404.06351.pdf,HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention,"Predicting the trajectories of road agents is essential for autonomous +driving systems. The recent mainstream methods follow a static paradigm, which +predicts the future trajectory by using a fixed duration of historical frames. +These methods make the predictions independently even at adjacent time steps, +which leads to potential instability and temporal inconsistency. As successive +time steps have largely overlapping historical frames, their forecasting should +have intrinsic correlation, such as overlapping predicted trajectories should +be consistent, or be different but share the same motion goal depending on the +road situation. Motivated by this, in this work, we introduce HPNet, a novel +dynamic trajectory forecasting method. Aiming for stable and accurate +trajectory forecasting, our method leverages not only historical frames +including maps and agent states, but also historical predictions. Specifically, +we newly design a Historical Prediction Attention module to automatically +encode the dynamic relationship between successive predictions. Besides, it +also extends the attention range beyond the currently visible window +benefitting from the use of historical predictions. The proposed Historical +Prediction Attention together with the Agent Attention and Mode Attention is +further formulated as the Triple Factorized Attention module, serving as the +core design of HPNet.Experiments on the Argoverse and INTERACTION datasets show +that HPNet achieves state-of-the-art performance, and generates accurate and +stable future trajectories. Our code are available at +https://github.com/XiaolongTang23/HPNet.",cs.CV,['cs.CV'] +Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection,Xiaowei Zhao · Xianglong Liu · Duorui Wang · Yajun Gao · Zhide Liu, ,https://arxiv.org/abs/2306.05493,,,Multi-Modal Classifiers for Open-Vocabulary Object Detection,"The goal of this paper is open-vocabulary object detection (OVOD) +$\unicode{x2013}$ building a model that can detect objects beyond the set of +categories seen at training, thus enabling the user to specify categories of +interest at inference without the need for model retraining. We adopt a +standard two-stage object detector architecture, and explore three ways for +specifying novel categories: via language descriptions, via image exemplars, or +via a combination of the two. We make three contributions: first, we prompt a +large language model (LLM) to generate informative language descriptions for +object classes, and construct powerful text-based classifiers; second, we +employ a visual aggregator on image exemplars that can ingest any number of +images as input, forming vision-based classifiers; and third, we provide a +simple method to fuse information from language descriptions and image +exemplars, yielding a multi-modal classifier. When evaluating on the +challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our +text-based classifiers outperform all previous OVOD works; (ii) our +vision-based classifiers perform as well as text-based classifiers in prior +work; (iii) using multi-modal classifiers perform better than either modality +alone; and finally, (iv) our text-based and multi-modal classifiers yield +better performance than a fully-supervised detector.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'I.4.6; I.4.8; I.4.9; I.2.10']" +CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation,Townim Chowdhury · Kewen Liao · Vu Minh Hieu Phan · Minh-Son To · Yutong Xie · Kevin Hung · David Ross · Anton van den Hengel · Johan Verjans · Zhibin Liao, ,https://arxiv.org/abs/2404.02388,,2404.02388.pdf,CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation,"Deep Neural Networks (DNNs) are widely used for visual classification tasks, +but their complex computation process and black-box nature hinder decision +transparency and interpretability. Class activation maps (CAMs) and recent +variants provide ways to visually explain the DNN decision-making process by +displaying 'attention' heatmaps of the DNNs. Nevertheless, the CAM explanation +only offers relative attention information, that is, on an attention heatmap, +we can interpret which image region is more or less important than the others. +However, these regions cannot be meaningfully compared across classes, and the +contribution of each region to the model's class prediction is not revealed. To +address these challenges that ultimately lead to better DNN Interpretation, in +this paper, we propose CAPE, a novel reformulation of CAM that provides a +unified and probabilistically meaningful assessment of the contributions of +image regions. We quantitatively and qualitatively compare CAPE with +state-of-the-art CAM methods on CUB and ImageNet benchmark datasets to +demonstrate enhanced interpretability. We also test on a cytology imaging +dataset depicting a challenging Chronic Myelomonocytic Leukemia (CMML) +diagnosis problem. Code is available at: https://github.com/AIML-MED/CAPE.",cs.CV,['cs.CV'] +Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training,Qian Li · Yuxiao Hu · Yinpeng Dong · Dongxiao Zhang · Yuntian Chen, ,https://arxiv.org/abs/2312.07067,,2312.07067.pdf,Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training,"Adversarial training is often formulated as a min-max problem, however, +concentrating only on the worst adversarial examples causes alternating +repetitive confusion of the model, i.e., previously defended or correctly +classified samples are not defensible or accurately classifiable in subsequent +adversarial training. We characterize such non-ignorable samples as ""hiders"", +which reveal the hidden high-risk regions within the secure area obtained +through adversarial training and prevent the model from finding the real worst +cases. We demand the model to prevent hiders when defending against adversarial +examples for improving accuracy and robustness simultaneously. By rethinking +and redefining the min-max optimization problem for adversarial training, we +propose a generalized adversarial training algorithm called Hider-Focused +Adversarial Training (HFAT). HFAT introduces the iterative evolution +optimization strategy to simplify the optimization problem and employs an +auxiliary model to reveal hiders, effectively combining the optimization +directions of standard adversarial training and prevention hiders. Furthermore, +we introduce an adaptive weighting mechanism that facilitates the model in +adaptively adjusting its focus between adversarial examples and hiders during +different training periods. We demonstrate the effectiveness of our method +based on extensive experiments, and ensure that HFAT can provide higher +robustness and accuracy.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV', 'stat.AP']" +Multi-Space Alignments Towards Universal LiDAR Segmentation,Youquan Liu · Lingdong Kong · Xiaoyang Wu · Runnan Chen · Xin Li · Liang Pan · Ziwei Liu · Yuexin Ma, ,https://arxiv.org/abs/2405.01538,,2405.01538.pdf,Multi-Space Alignments Towards Universal LiDAR Segmentation,"A unified and versatile LiDAR segmentation model with strong robustness and +generalizability is desirable for safe autonomous driving perception. This work +presents M3Net, a one-of-a-kind framework for fulfilling multi-task, +multi-dataset, multi-modality LiDAR segmentation in a universal manner using +just a single set of parameters. To better exploit data volume and diversity, +we first combine large-scale driving datasets acquired by different types of +sensors from diverse scenes and then conduct alignments in three spaces, namely +data, feature, and label spaces, during the training. As a result, M3Net is +capable of taming heterogeneous data for training state-of-the-art LiDAR +segmentation models. Extensive experiments on twelve LiDAR segmentation +datasets verify our effectiveness. Notably, using a shared set of parameters, +M3Net achieves 75.1%, 83.1%, and 72.4% mIoU scores, respectively, on the +official benchmarks of SemanticKITTI, nuScenes, and Waymo Open.",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']" +Fast ODE-based Sampling for Diffusion Models in Around 5 Steps,Zhenyu Zhou · Defang Chen · Can Wang · Chun Chen, ,https://arxiv.org/abs/2312.00094,,2312.00094.pdf,Fast ODE-based Sampling for Diffusion Models in Around 5 Steps,"Sampling from diffusion models can be treated as solving the corresponding +ordinary differential equations (ODEs), with the aim of obtaining an accurate +solution with as few number of function evaluations (NFE) as possible. +Recently, various fast samplers utilizing higher-order ODE solvers have emerged +and achieved better performance than the initial first-order one. However, +these numerical methods inherently result in certain approximation errors, +which significantly degrades sample quality with extremely small NFE (e.g., +around 5). In contrast, based on the geometric observation that each sampling +trajectory almost lies in a two-dimensional subspace embedded in the ambient +space, we propose Approximate MEan-Direction Solver (AMED-Solver) that +eliminates truncation errors by directly learning the mean direction for fast +diffusion sampling. Besides, our method can be easily used as a plugin to +further improve existing ODE-based samplers. Extensive experiments on image +synthesis with the resolution ranging from 32 to 512 demonstrate the +effectiveness of our method. With only 5 NFE, we achieve 6.61 FID on CIFAR-10, +10.74 FID on ImageNet 64$\times$64, and 13.20 FID on LSUN Bedroom. Our code is +available at https://github.com/zju-pi/diff-sampler.",cs.CV,"['cs.CV', 'cs.AI']" +OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies,Lingdong Kong · Youquan Liu · Lai Xing Ng · Benoit Cottereau · Wei Tsang Ooi,https://github.com/ldkong1205/OpenESS,http://export.arxiv.org/abs/2405.05259,,2405.05259.pdf,OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies,"Event-based semantic segmentation (ESS) is a fundamental yet challenging task +for event camera sensing. The difficulties in interpreting and annotating event +data limit its scalability. While domain adaptation from images to event data +can help to mitigate this issue, there exist data representational differences +that require additional effort to resolve. In this work, for the first time, we +synergize information from image, text, and event-data domains and introduce +OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. +We achieve this goal by transferring the semantically rich CLIP knowledge from +image-text pairs to event streams. To pursue better cross-modality adaptation, +we propose a frame-to-event contrastive distillation and a text-to-event +semantic consistency regularization. Experimental results on popular ESS +benchmarks showed our approach outperforms existing methods. Notably, we +achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either +event or frame labels.",cs.CV,"['cs.CV', 'cs.RO']" +Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior,Zike Wu · Pan Zhou · YI Xuanyu · Xiaoding Yuan · Hanwang Zhang, ,,https://paperswithcode.com/paper/consistent3d-towards-consistent-high-fidelity,,,,,nan +VMINer: Versatile Multi-view Inverse Rendering with Near- and Far-field Light Sources,Fan Fei · Jiajun Tang · Ping Tan · Boxin Shi,https://costrice.github.io/vminer/,https://arxiv.org/abs/2402.06136,,2402.06136.pdf,SIR: Multi-view Inverse Rendering with Decomposable Shadow for Indoor Scenes,"We propose SIR, an efficient method to decompose differentiable shadows for +inverse rendering on indoor scenes using multi-view data, addressing the +challenges in accurately decomposing the materials and lighting conditions. +Unlike previous methods that struggle with shadow fidelity in complex lighting +environments, our approach explicitly learns shadows for enhanced realism in +material estimation under unknown light positions. Utilizing posed HDR images +as input, SIR employs an SDF-based neural radiance field for comprehensive +scene representation. Then, SIR integrates a shadow term with a three-stage +material estimation approach to improve SVBRDF quality. Specifically, SIR is +designed to learn a differentiable shadow, complemented by BRDF regularization, +to optimize inverse rendering accuracy. Extensive experiments on both synthetic +and real-world indoor scenes demonstrate the superior performance of SIR over +existing methods in both quantitative metrics and qualitative analysis. The +significant decomposing ability of SIR enables sophisticated editing +capabilities like free-view relighting, object insertion, and material +replacement. The code and data are available at +https://xiaokangwei.github.io/SIR/.",cs.CV,['cs.CV'] +Weak-to-Strong 3D Object Detection with X-Ray Distillation,Alexander Gambashidze · Aleksandr Dadukin · Maksim Golyadkin · Maria Razzhivina · Ilya Makarov, ,https://arxiv.org/abs/2404.00679,,2404.00679.pdf,Weak-to-Strong 3D Object Detection with X-Ray Distillation,"This paper addresses the critical challenges of sparsity and occlusion in +LiDAR-based 3D object detection. Current methods often rely on supplementary +modules or specific architectural designs, potentially limiting their +applicability to new and evolving architectures. To our knowledge, we are the +first to propose a versatile technique that seamlessly integrates into any +existing framework for 3D Object Detection, marking the first instance of +Weak-to-Strong generalization in 3D computer vision. We introduce a novel +framework, X-Ray Distillation with Object-Complete Frames, suitable for both +supervised and semi-supervised settings, that leverages the temporal aspect of +point cloud sequences. This method extracts crucial information from both +previous and subsequent LiDAR frames, creating Object-Complete frames that +represent objects from multiple viewpoints, thus addressing occlusion and +sparsity. Given the limitation of not being able to generate Object-Complete +frames during online inference, we utilize Knowledge Distillation within a +Teacher-Student framework. This technique encourages the strong Student model +to emulate the behavior of the weaker Teacher, which processes simple and +informative Object-Complete frames, effectively offering a comprehensive view +of objects as if seen through X-ray vision. Our proposed methods surpass +state-of-the-art in semi-supervised learning by 1-1.5 mAP and enhance the +performance of five established supervised models by 1-2 mAP on standard +autonomous driving datasets, even with default hyperparameters. Code for +Object-Complete frames is available here: +https://github.com/sakharok13/X-Ray-Teacher-Patching-Tools.",cs.CV,['cs.CV'] +AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents,Jieming Cui · Tengyu Liu · Nian Liu · Yaodong Yang · Yixin Zhu · Siyuan Huang, ,https://arxiv.org/abs/2403.12835,,2403.12835.pdf,AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents,"Traditional approaches in physics-based motion generation, centered around +imitation learning and reward shaping, often struggle to adapt to new +scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical +method that learns physically plausible interactions following open-vocabulary +instructions. Our approach begins by developing a set of atomic actions via a +low-level controller trained via imitation learning. Upon receiving an +open-vocabulary textual instruction, AnySkill employs a high-level policy that +selects and integrates these atomic actions to maximize the CLIP similarity +between the agent's rendered images and the text. An important feature of our +method is the use of image-based rewards for the high-level policy, which +allows the agent to learn interactions with objects without manual reward +engineering. We demonstrate AnySkill's capability to generate realistic and +natural motion sequences in response to unseen instructions of varying lengths, +marking it the first method capable of open-vocabulary physical skill learning +for interactive humanoid agents.",cs.CV,"['cs.CV', 'cs.RO']" +Learning Continuous 3D Words for Text-to-Image Generation,Ta-Ying Cheng · Matheus Gadelha · Thibault Groueix · Matthew Fisher · Radomir Mech · Andrew Markham · Niki Trigoni,https://ttchengab.github.io/continuous_3d_words/,https://arxiv.org/abs/2402.08654,,2402.08654.pdf,Learning Continuous 3D Words for Text-to-Image Generation,"Current controls over diffusion models (e.g., through text or ControlNet) for +image generation fall short in recognizing abstract, continuous attributes like +illumination direction or non-rigid shape change. In this paper, we present an +approach for allowing users of text-to-image models to have fine-grained +control of several attributes in an image. We do this by engineering special +sets of input tokens that can be transformed in a continuous manner -- we call +them Continuous 3D Words. These attributes can, for example, be represented as +sliders and applied jointly with text prompts for fine-grained control over +image generation. Given only a single mesh and a rendering engine, we show that +our approach can be adopted to provide continuous user control over several +3D-aware attributes, including time-of-day illumination, bird wing orientation, +dollyzoom effect, and object poses. Our method is capable of conditioning image +creation with multiple Continuous 3D Words and text descriptions simultaneously +while adding no overhead to the generative process. Project Page: +https://ttchengab.github.io/continuous_3d_words",cs.CV,['cs.CV'] +Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model,Xu He · Qiaochu Huang · Zhensong Zhang · Zhiwei Lin · Zhiyong Wu · Sicheng Yang · Minglei Li · Zhiyi Chen · Songcen Xu · Xiaofei Wu, ,https://arxiv.org/abs/2404.01862v1,,2404.01862v1.pdf,Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model,"Co-speech gestures, if presented in the lively form of videos, can achieve +superior visual effects in human-machine interaction. While previous works +mostly generate structural human skeletons, resulting in the omission of +appearance information, we focus on the direct generation of audio-driven +co-speech gesture videos in this work. There are two main challenges: 1) A +suitable motion feature is needed to describe complex human movements with +crucial appearance information. 2) Gestures and speech exhibit inherent +dependencies and should be temporally aligned even of arbitrary length. To +solve these problems, we present a novel motion-decoupled framework to generate +co-speech gesture videos. Specifically, we first introduce a well-designed +nonlinear TPS transformation to obtain latent motion features preserving +essential appearance information. Then a transformer-based diffusion model is +proposed to learn the temporal correlation between gestures and speech, and +performs generation in the latent motion space, followed by an optimal motion +selection module to produce long-term coherent and consistent gesture videos. +For better visual perception, we further design a refinement network focusing +on missing details of certain areas. Extensive experimental results show that +our proposed framework significantly outperforms existing approaches in both +motion and video-related evaluations. Our code, demos, and more resources are +available at https://github.com/thuhcsi/S2G-MDDiffusion.",cs.CV,"['cs.CV', 'cs.HC', 'cs.MM']" +Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion,Lalit Manam · Venu Madhav Govindu,https://ee.iisc.ac.in/cvlab/research/camtripsfm/,,,,,,,nan +TextCraftor: Your Text Encoder Can be Image Quality Controller,Yanyu Li · Xian Liu · Anil Kag · Ju Hu · Yerlan Idelbayev · Dhritiman Sagar · Yanzhi Wang · Sergey Tulyakov · Jian Ren, ,https://arxiv.org/abs/2403.18978,,2403.18978.pdf,TextCraftor: Your Text Encoder Can be Image Quality Controller,"Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have +revolutionized the field of content generation, enabling significant +advancements in areas like image editing and video synthesis. Despite their +formidable capabilities, these models are not without their limitations. It is +still challenging to synthesize an image that aligns well with the input text, +and multiple runs with carefully crafted prompts are required to achieve +satisfactory results. To mitigate these limitations, numerous studies have +endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing +various technologies. Yet, amidst these efforts, a pivotal question of +text-to-image diffusion model training has remained largely unexplored: Is it +possible and feasible to fine-tune the text encoder to improve the performance +of text-to-image diffusion models? Our findings reveal that, instead of +replacing the CLIP text encoder used in Stable Diffusion with other large +language models, we can enhance it through our proposed fine-tuning approach, +TextCraftor, leading to substantial improvements in quantitative benchmarks and +human assessments. Interestingly, our technique also empowers controllable +image generation through the interpolation of different text encoders +fine-tuned with various rewards. We also demonstrate that TextCraftor is +orthogonal to UNet finetuning, and can be combined to further improve +generative quality.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining,Jiahao Nie · Yun Xing · Gongjie Zhang · Pei Yan · Aoran Xiao · Yap-peng Tan · Alex C. Kot · Shijian Lu, ,https://arxiv.org/abs/2401.08407,,,Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining,"Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting +novel categories from a distinct domain using only limited exemplars. In this +paper, we undertake a comprehensive study of CD-FSS and uncover two crucial +insights: (i) the necessity of a fine-tuning stage to effectively transfer the +learned meta-knowledge across domains, and (ii) the overfitting risk during the +na\""ive fine-tuning due to the scarcity of novel category examples. With these +insights, we propose a novel cross-domain fine-tuning strategy that addresses +the challenging CD-FSS tasks. We first design Bi-directional Few-shot +Prediction (BFP), which establishes support-query correspondence in a +bi-directional manner, crafting augmented supervision to reduce the overfitting +risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA), which +is a recursive framework to capture the support-query correspondence +iteratively, targeting maximal exploitation of supervisory signals from the +sparse novel category samples. Extensive empirical evaluations show that our +method significantly outperforms the state-of-the-arts (+7.8\%), which verifies +that IFA tackles the cross-domain challenges and mitigates the overfitting +simultaneously. The code is available at: https://github.com/niejiahao1998/IFA.",cs.CV,['cs.CV'] +Learning Large-Factor EM Image Super-Resolution with Generative Priors,Jiateng Shou · Zeyu Xiao · Shiyu Deng · Wei Huang · ShiPeiyao · Ruobing Zhang · Zhiwei Xiong · Feng Wu,https://github.com/jtshou/GPEMSR,https://arxiv.org/html/2405.07044v1,,2405.07044v1.pdf,Semantic Guided Large Scale Factor Remote Sensing Image Super-resolution with Generative Diffusion Prior,"Remote sensing images captured by different platforms exhibit significant +disparities in spatial resolution. Large scale factor super-resolution (SR) +algorithms are vital for maximizing the utilization of low-resolution (LR) +satellite data captured from orbit. However, existing methods confront +challenges in recovering SR images with clear textures and correct ground +objects. We introduce a novel framework, the Semantic Guided Diffusion Model +(SGDM), designed for large scale factor remote sensing image super-resolution. +The framework exploits a pre-trained generative model as a prior to generate +perceptually plausible SR images. We further enhance the reconstruction by +incorporating vector maps, which carry structural and semantic cues. Moreover, +pixel-level inconsistencies in paired remote sensing images, stemming from +sensor-specific imaging characteristics, may hinder the convergence of the +model and diversity in generated results. To address this problem, we propose +to extract the sensor-specific imaging characteristics and model the +distribution of them, allowing diverse SR images generation based on imaging +characteristics provided by reference images or sampled from the imaging +characteristic probability distributions. To validate and evaluate our +approach, we create the Cross-Modal Super-Resolution Dataset (CMSRD). +Qualitative and quantitative experiments on CMSRD showcase the superiority and +broad applicability of our method. Experimental results on downstream vision +tasks also demonstrate the utilitarian of the generated SR images. The dataset +and code will be publicly available at https://github.com/wwangcece/SGDM",cs.CV,['cs.CV'] +FairCLIP: Harnessing Fairness in Vision-Language Learning,Yan Luo · MIN SHI · Muhammad Osama Khan · Muhammad Muneeb Afzal · Hao Huang · Shuaihang Yuan · Yu Tian · Luo Song · Ava Kouhana · Tobias Elze · Yi Fang · Mengyu Wang, ,https://arxiv.org/abs/2403.19949,,2403.19949.pdf,FairCLIP: Harnessing Fairness in Vision-Language Learning,"Fairness is a critical concern in deep learning, especially in healthcare, +where these models influence diagnoses and treatment decisions. Although +fairness has been investigated in the vision-only domain, the fairness of +medical vision-language (VL) models remains unexplored due to the scarcity of +medical VL datasets for studying fairness. To bridge this research gap, we +introduce the first fair vision-language medical dataset Harvard-FairVLMed that +provides detailed demographic attributes, ground-truth labels, and clinical +notes to facilitate an in-depth examination of fairness within VL foundation +models. Using Harvard-FairVLMed, we conduct a comprehensive fairness analysis +of two widely-used VL models (CLIP and BLIP2), pre-trained on both natural and +medical domains, across four different protected attributes. Our results +highlight significant biases in all VL models, with Asian, Male, Non-Hispanic, +and Spanish being the preferred subgroups across the protected attributes of +race, gender, ethnicity, and language, respectively. In order to alleviate +these biases, we propose FairCLIP, an optimal-transport-based approach that +achieves a favorable trade-off between performance and fairness by reducing the +Sinkhorn distance between the overall sample distribution and the distributions +corresponding to each demographic group. As the first VL dataset of its kind, +Harvard-FairVLMed holds the potential to catalyze advancements in the +development of machine learning models that are both ethically aware and +clinically effective. Our dataset and code are available at +https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k.",cs.CV,['cs.CV'] +Distributionally Generative Augmentation for Fair Facial Attribute Classification,Fengda Zhang · Qianpei He · Kun Kuang · Jiashuo Liu · Long Chen · Chao Wu · Jun Xiao · Hanwang Zhang,https://github.com/heqianpei/DiGA,https://arxiv.org/abs/2403.06606,,2403.06606.pdf,Distributionally Generative Augmentation for Fair Facial Attribute Classification,"Facial Attribute Classification (FAC) holds substantial promise in widespread +applications. However, FAC models trained by traditional methodologies can be +unfair by exhibiting accuracy inconsistencies across varied data +subpopulations. This unfairness is largely attributed to bias in data, where +some spurious attributes (e.g., Male) statistically correlate with the target +attribute (e.g., Smiling). Most of existing fairness-aware methods rely on the +labels of spurious attributes, which may be unavailable in practice. This work +proposes a novel, generation-based two-stage framework to train a fair FAC +model on biased data without additional annotation. Initially, we identify the +potential spurious attributes based on generative models. Notably, it enhances +interpretability by explicitly showing the spurious attributes in image space. +Following this, for each image, we first edit the spurious attributes with a +random degree sampled from a uniform distribution, while keeping target +attribute unchanged. Then we train a fair FAC model by fostering model +invariance to these augmentation. Extensive experiments on three common +datasets demonstrate the effectiveness of our method in promoting fairness in +FAC without compromising accuracy. Codes are in +https://github.com/heqianpei/DiGA.",cs.CV,"['cs.CV', 'cs.LG']" +RobustSAM: Segment Anything Robustly on Degraded Images,Wei-Ting Chen · Yu Jiet Vong · Sy-Yen Kuo · Sizhuo Ma · Jian Wang, ,https://arxiv.org/abs/2306.07713,,2306.07713.pdf,Robustness of SAM: Segment Anything Under Corruptions and Beyond,"Segment anything model (SAM), as the name suggests, is claimed to be capable +of cutting out any object and demonstrates impressive zero-shot transfer +performance with the guidance of prompts. However, there is currently a lack of +comprehensive evaluation regarding its robustness under various corruptions. +Understanding the robustness of SAM across different corruption scenarios is +crucial for its real-world deployment. Prior works show that SAM is biased +towards texture (style) rather than shape, motivated by which we start by +investigating its robustness against style transfer, which is synthetic +corruption. Following by interpreting the effects of synthetic corruption as +style changes, we proceed to conduct a comprehensive evaluation for its +robustness against 15 types of common corruption. These corruptions mainly fall +into categories such as digital, noise, weather, and blur, and within each +corruption category, we explore 5 severity levels to simulate real-world +corruption scenarios. Beyond the corruptions, we further assess the robustness +of SAM against local occlusion and local adversarial patch attacks. To the best +of our knowledge, our work is the first of its kind to evaluate the robustness +of SAM under style change, local occlusion, and local adversarial patch +attacks. Given that patch attacks visible to human eyes are easily detectable, +we further assess its robustness against global adversarial attacks that are +imperceptible to human eyes. Overall, this work provides a comprehensive +empirical study of the robustness of SAM, evaluating its performance under +various corruptions and extending the assessment to critical aspects such as +local occlusion, local adversarial patch attacks, and global adversarial +attacks. These evaluations yield valuable insights into the practical +applicability and effectiveness of SAM in addressing real-world challenges.",cs.CV,['cs.CV'] +ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation,Dar-Yen Chen · Hamish Tennent · Ching-Wen Hsu,https://cardinalblue.github.io/artadapter.github.io/,https://arxiv.org/abs/2312.02109v1,,2312.02109v1.pdf,ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation,"This work introduces ArtAdapter, a transformative text-to-image (T2I) style +transfer framework that transcends traditional limitations of color, +brushstrokes, and object shape, capturing high-level style elements such as +composition and distinctive artistic expression. The integration of a +multi-level style encoder with our proposed explicit adaptation mechanism +enables ArtAdapte to achieve unprecedented fidelity in style transfer, ensuring +close alignment with textual descriptions. Additionally, the incorporation of +an Auxiliary Content Adapter (ACA) effectively separates content from style, +alleviating the borrowing of content from style references. Moreover, our novel +fast finetuning approach could further enhance zero-shot style representation +while mitigating the risk of overfitting. Comprehensive evaluations confirm +that ArtAdapter surpasses current state-of-the-art methods.",cs.CV,['cs.CV'] +NAPGuard: Towards Detecting Naturalistic Adversarial Patches,Siyang Wu · Jiakai Wang · Jiejie Zhao · Yazhe Wang · Xianglong Liu,https://github.com/wsynuiag/NAPGaurd,https://arxiv.org/abs/2307.08076,,2307.08076.pdf,Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector,"Many physical adversarial patch generation methods are widely proposed to +protect personal privacy from malicious monitoring using object detectors. +However, they usually fail to generate satisfactory patch images in terms of +both stealthiness and attack performance without making huge efforts on careful +hyperparameter tuning. To address this issue, we propose a novel naturalistic +adversarial patch generation method based on the diffusion models (DM). Through +sampling the optimal image from the DM model pretrained upon natural images, it +allows us to stably craft high-quality and naturalistic physical adversarial +patches to humans without suffering from serious mode collapse problems as +other deep generative models. To the best of our knowledge, we are the first to +propose DM-based naturalistic adversarial patch generation for object +detectors. With extensive quantitative, qualitative, and subjective +experiments, the results demonstrate the effectiveness of the proposed approach +to generate better-quality and more naturalistic adversarial patches while +achieving acceptable attack performance than other state-of-the-art patch +generation methods. We also show various generation trade-offs under different +conditions.",cs.CV,['cs.CV'] +DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer,Wei-Ting Chen · Gurunandan Krishnan · Qiang Gao · Sy-Yen Kuo · Sizhuo Ma · Jian Wang, ,,https://ieeexplore.ieee.org/abstract/document/10381809/authors,,,,,nan +PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios,Jingbo Wang · Zhengyi Luo · Ye Yuan · Yixuan LI · Bo Dai, ,https://arxiv.org/html/2404.19722v1,,2404.19722v1.pdf,PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios,"We address the challenge of content diversity and controllability in +pedestrian simulation for driving scenarios. Recent pedestrian animation +frameworks have a significant limitation wherein they primarily focus on either +following trajectory [46] or the content of the reference video [57], +consequently overlooking the potential diversity of human motion within such +scenarios. This limitation restricts the ability to generate pedestrian +behaviors that exhibit a wider range of variations and realistic motions and +therefore restricts its usage to provide rich motion content for other +components in the driving simulation system, e.g., suddenly changed motion to +which the autonomous vehicle should respond. In our approach, we strive to +surpass the limitation by showcasing diverse human motions obtained from +various sources, such as generated human motions, in addition to following the +given trajectory. The fundamental contribution of our framework lies in +combining the motion tracking task with trajectory following, which enables the +tracking of specific motion parts (e.g., upper body) while simultaneously +following the given trajectory by a single policy. This way, we significantly +enhance both the diversity of simulated human motion within the given scenario +and the controllability of the content, including language-based control. Our +framework facilitates the generation of a wide range of human motions, +contributing to greater realism and adaptability in pedestrian simulations for +driving scenarios. More information is on our project page +https://wangjingbo1219.github.io/papers/CVPR2024_PACER_PLUS/PACERPLUSPage.html .",cs.CV,['cs.CV'] +Cache Me if You Can: Accelerating Diffusion Models through Block Caching,Felix Wimbauer · Bichen Wu · Edgar Schoenfeld · Xiaoliang Dai · Ji Hou · Zijian He · Artsiom Sanakoyeu · Peizhao Zhang · Sam Tsai · Jonas Kohler · Christian Rupprecht · Daniel Cremers · Peter Vajda · Jialiang Wang, ,https://arxiv.org/abs/2312.03209,,2312.03209.pdf,Cache Me if You Can: Accelerating Diffusion Models through Block Caching,"Diffusion models have recently revolutionized the field of image synthesis +due to their ability to generate photorealistic images. However, one of the +major drawbacks of diffusion models is that the image generation process is +costly. A large image-to-image network has to be applied many times to +iteratively refine an image from random noise. While many recent works propose +techniques to reduce the number of required steps, they generally treat the +underlying denoising network as a black box. In this work, we investigate the +behavior of the layers within the network and find that 1) the layers' output +changes smoothly over time, 2) the layers show distinct patterns of change, and +3) the change from step to step is often very small. We hypothesize that many +layer computations in the denoising network are redundant. Leveraging this, we +introduce block caching, in which we reuse outputs from layer blocks of +previous steps to speed up inference. Furthermore, we propose a technique to +automatically determine caching schedules based on each block's changes over +timesteps. In our experiments, we show through FID, human evaluation and +qualitative analysis that Block Caching allows to generate images with higher +visual quality at the same computational cost. We demonstrate this for +different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).",cs.CV,['cs.CV'] +Multi-Modal Hallucination Control by Visual Information Grounding,Alessandro Favero · Luca Zancato · Matthew Trager · Siddharth Choudhary · Pramuditha Perera · Alessandro Achille · Ashwin Swaminathan · Stefano Soatto, ,https://arxiv.org/abs/2403.14003,,2403.14003.pdf,Multi-Modal Hallucination Control by Visual Information Grounding,"Generative Vision-Language Models (VLMs) are prone to generate +plausible-sounding textual answers that, however, are not always grounded in +the input image. We investigate this phenomenon, usually referred to as +""hallucination"" and show that it stems from an excessive reliance on the +language prior. In particular, we show that as more tokens are generated, the +reliance on the visual prompt decreases, and this behavior strongly correlates +with the emergence of hallucinations. To reduce hallucinations, we introduce +Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for +prompt amplification. M3ID amplifies the influence of the reference image over +the language prior, hence favoring the generation of tokens with higher mutual +information with the visual prompt. M3ID can be applied to any pre-trained +autoregressive VLM at inference time without necessitating further training and +with minimal computational overhead. If training is an option, we show that +M3ID can be paired with Direct Preference Optimization (DPO) to improve the +model's reliance on the prompt image without requiring any labels. Our +empirical findings show that our algorithms maintain the fluency and linguistic +capabilities of pre-trained VLMs while reducing hallucinations by mitigating +visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and +M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by +25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as +POPE by 21% and 24%.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +Diffusion Time-step Curriculum for One Image to 3D Generation,YI Xuanyu · Zike Wu · Qingshan Xu · Pan Zhou · Joo Lim · Hanwang Zhang, ,https://arxiv.org/abs/2404.04562,,2404.04562.pdf,Diffusion Time-step Curriculum for One Image to 3D Generation,"Score distillation sampling~(SDS) has been widely adopted to overcome the +absence of unseen views in reconstructing 3D objects from a \textbf{single} +image. It leverages pre-trained 2D diffusion models as teacher to guide the +reconstruction of student 3D models. Despite their remarkable success, +SDS-based methods often encounter geometric artifacts and texture saturation. +We find out the crux is the overlooked indiscriminate treatment of diffusion +time-steps during optimization: it unreasonably treats the student-teacher +knowledge distillation to be equal at all time-steps and thus entangles +coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion +Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the +teacher and student models collaborating with the time-step curriculum in a +coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and +Level50 benchmark demonstrate that DTC123 can produce multi-view consistent, +high-quality, and diverse 3D assets. Codes and more generation demos will be +released in https://github.com/yxymessi/DTC123.",cs.CV,['cs.CV'] +3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images,Yifang Men · Hanxi Liu · Yuan Yao · Miaomiao Cui · Xuansong Xie · Zhouhui Lian, ,https://arxiv.org/abs/2311.17917,,2311.17917.pdf,AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text,"We study the problem of creating high-fidelity and animatable 3D avatars from +only textual descriptions. Existing text-to-avatar methods are either limited +to static avatars which cannot be animated or struggle to generate animatable +avatars with promising quality and precise pose control. To address these +limitations, we propose AvatarStudio, a coarse-to-fine generative model that +generates explicit textured 3D meshes for animatable human avatars. +Specifically, AvatarStudio begins with a low-resolution NeRF-based +representation for coarse generation, followed by incorporating SMPL-guided +articulation into the explicit mesh representation to support avatar animation +and high resolution rendering. To ensure view consistency and pose +controllability of the resulting avatars, we introduce a 2D diffusion model +conditioned on DensePose for Score Distillation Sampling supervision. By +effectively leveraging the synergy between the articulated mesh representation +and the DensePose-conditional diffusion model, AvatarStudio can create +high-quality avatars from text that are ready for animation, significantly +outperforming previous methods. Moreover, it is competent for many +applications, e.g., multimodal avatar animations and style-guided avatar +creation. For more results, please refer to our project page: +http://jeff95.me/projects/avatarstudio.html",cs.GR,"['cs.GR', 'cs.CV']" +Unleashing Network Potentials for Semantic Scene Completion,Fengyun Wang · Qianru Sun · Dong Zhang · Jinhui Tang,https://github.com/fereenwong/AMMNet,https://arxiv.org/abs/2403.07560v2,,2403.07560v2.pdf,Unleashing Network Potentials for Semantic Scene Completion,"Semantic scene completion (SSC) aims to predict complete 3D voxel occupancy +and semantics from a single-view RGB-D image, and recent SSC methods commonly +adopt multi-modal inputs. However, our investigation reveals two limitations: +ineffective feature learning from single modalities and overfitting to limited +datasets. To address these issues, this paper proposes a novel SSC framework - +Adversarial Modality Modulation Network (AMMNet) - with a fresh perspective of +optimizing gradient updates. The proposed AMMNet introduces two core modules: a +cross-modal modulation enabling the interdependence of gradient flows between +modalities, and a customized adversarial training scheme leveraging dynamic +gradient competition. Specifically, the cross-modal modulation adaptively +re-calibrates the features to better excite representation potentials from each +single modality. The adversarial training employs a minimax game of evolving +gradients, with customized guidance to strengthen the generator's perception of +visual fidelity from both geometric completeness and semantic correctness. +Extensive experimental results demonstrate that AMMNet outperforms +state-of-the-art SSC methods by a large margin, providing a promising direction +for improving the effectiveness and generalization of SSC methods.",cs.CV,['cs.CV'] +NeRF Director: Revisiting View Selection in Neural Volume Rendering,Wenhui Xiao · Rodrigo Santa Cruz · David Ahmedt-Aristizabal · Olivier Salvado · Clinton Fookes · Leo Lebrat,https://wenwhx.github.io/nerfdirector/,https://arxiv.org/abs/2310.20685,,2310.20685.pdf,NeRF Revisited: Fixing Quadrature Instability in Volume Rendering,"Neural radiance fields (NeRF) rely on volume rendering to synthesize novel +views. Volume rendering requires evaluating an integral along each ray, which +is numerically approximated with a finite sum that corresponds to the exact +integral along the ray under piecewise constant volume density. As a +consequence, the rendered result is unstable w.r.t. the choice of samples along +the ray, a phenomenon that we dub quadrature instability. We propose a +mathematically principled solution by reformulating the sample-based rendering +equation so that it corresponds to the exact integral under piecewise linear +volume density. This simultaneously resolves multiple issues: conflicts between +samples along different rays, imprecise hierarchical sampling, and +non-differentiability of quantiles of ray termination distances w.r.t. model +parameters. We demonstrate several benefits over the classical sample-based +rendering equation, such as sharper textures, better geometric reconstruction, +and stronger depth supervision. Our proposed formulation can be also be used as +a drop-in replacement to the volume rendering equation of existing NeRF-based +methods. Our project page can be found at pl-nerf.github.io.",cs.CV,['cs.CV'] +Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios,Shiyan Chen · Jiyuan Zhang · Zhaofei Yu · Tiejun Huang, ,https://ar5iv.labs.arxiv.org/html/2303.16783,,2303.16783.pdf,Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios,"Self-supervised denoising has attracted widespread attention due to its +ability to train without clean images. However, noise in real-world scenarios +is often spatially correlated, which causes many self-supervised algorithms +that assume pixel-wise independent noise to perform poorly. Recent works have +attempted to break noise correlation with downsampling or neighborhood masking. +However, denoising on downsampled subgraphs can lead to aliasing effects and +loss of details due to a lower sampling rate. Furthermore, the neighborhood +masking methods either come with high computational complexity or do not +consider local spatial preservation during inference. Through the analysis of +existing methods, we point out that the key to obtaining high-quality and +texture-rich results in real-world self-supervised denoising tasks is to train +at the original input resolution structure and use asymmetric operations during +training and inference. Based on this, we propose Asymmetric Tunable Blind-Spot +Network (AT-BSN), where the blind-spot size can be freely adjusted, thus better +balancing noise correlation suppression and image local spatial destruction +during training and inference. In addition, we regard the pre-trained AT-BSN as +a meta-teacher network capable of generating various teacher networks by +sampling different blind-spots. We propose a blind-spot based multi-teacher +distillation strategy to distill a lightweight network, significantly improving +performance. Experimental results on multiple datasets prove that our method +achieves state-of-the-art, and is superior to other self-supervised algorithms +in terms of computational overhead and visual effects.",cs.CV,['cs.CV'] +Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding,Alessandro Achille · Greg Ver Steeg · Tian Yu Liu · Matthew Trager · Carson Klingenberg · Stefano Soatto, ,https://arxiv.org/abs/2402.08919v1,,2402.08919v1.pdf,Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding,"Quantifying the degree of similarity between images is a key copyright issue +for image-based machine learning. In legal doctrine however, determining the +degree of similarity between works requires subjective analysis, and +fact-finders (judges and juries) can demonstrate considerable variability in +these subjective judgement calls. Images that are structurally similar can be +deemed dissimilar, whereas images of completely different scenes can be deemed +similar enough to support a claim of copying. We seek to define and compute a +notion of ""conceptual similarity"" among images that captures high-level +relations even among images that do not share repeated elements or visually +similar components. The idea is to use a base multi-modal model to generate +""explanations"" (captions) of visual data at increasing levels of complexity. +Then, similarity can be measured by the length of the caption needed to +discriminate between the two images: Two highly dissimilar images can be +discriminated early in their description, whereas conceptually dissimilar ones +will need more detail to be distinguished. We operationalize this definition +and show that it correlates with subjective (averaged human evaluation) +assessment, and beats existing baselines on both image-to-image and +text-to-text similarity benchmarks. Beyond just providing a number, our method +also offers interpretability by pointing to the specific level of granularity +of the description where the source data are differentiated.",cs.CV,"['cs.CV', 'cs.LG']" +Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models,Samar Fares · Karthik Nandakumar, ,https://arxiv.org/abs/2312.06230,,2312.06230.pdf,Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks,"This work studies the task of poisoned sample detection for defending against +data poisoning based backdoor attacks. Its core challenge is finding a +generalizable and discriminative metric to distinguish between clean and +various types of poisoned samples (e.g., various triggers, various poisoning +ratios). Inspired by a common phenomenon in backdoor attacks that the +backdoored model tend to map significantly different poisoned and clean samples +within the target class to similar activation areas, we introduce a novel +perspective of the circular distribution of the gradients w.r.t. sample +activation, dubbed gradient circular distribution (GCD). And, we find two +interesting observations based on GCD. One is that the GCD of samples in the +target class is much more dispersed than that in the clean class. The other is +that in the GCD of target class, poisoned and clean samples are clearly +separated. Inspired by above two observations, we develop an innovative +three-stage poisoned sample detection approach, called Activation Gradient +based Poisoned sample Detection (AGPD). First, we calculate GCDs of all classes +from the model trained on the untrustworthy dataset. Then, we identify the +target class(es) based on the difference on GCD dispersion between target and +clean classes. Last, we filter out poisoned samples within the identified +target class(es) based on the clear separation between poisoned and clean +samples. Extensive experiments under various settings of backdoor attacks +demonstrate the superior detection performance of the proposed method to +existing poisoned detection approaches according to sample activation-based +metrics.",cs.CR,['cs.CR'] +YOLO-World: Real-Time Open-Vocabulary Object Detection,Tianheng Cheng · Lin Song · Yixiao Ge · Wenyu Liu · Xinggang Wang · Ying Shan,https://github.com/AILab-CVC/YOLO-World,https://arxiv.org/abs/2401.17270,,2401.17270.pdf,YOLO-World: Real-Time Open-Vocabulary Object Detection,"The You Only Look Once (YOLO) series of detectors have established themselves +as efficient and practical tools. However, their reliance on predefined and +trained object categories limits their applicability in open scenarios. +Addressing this limitation, we introduce YOLO-World, an innovative approach +that enhances YOLO with open-vocabulary detection capabilities through +vision-language modeling and pre-training on large-scale datasets. +Specifically, we propose a new Re-parameterizable Vision-Language Path +Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate +the interaction between visual and linguistic information. Our method excels in +detecting a wide range of objects in a zero-shot manner with high efficiency. +On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on +V100, which outperforms many state-of-the-art methods in terms of both accuracy +and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable +performance on several downstream tasks, including object detection and +open-vocabulary instance segmentation.",cs.CV,['cs.CV'] +Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,Devikalyan Das · Christopher Wewer · Raza Yunus · Eddy Ilg · Jan Lenssen,https://geometric-rl.mpi-inf.mpg.de/npg/,https://arxiv.org/abs/2312.01196,,2312.01196.pdf,Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,"Reconstructing dynamic objects from monocular videos is a severely +underconstrained and challenging problem, and recent work has approached it in +various directions. However, owing to the ill-posed nature of this problem, +there has been no solution that can provide consistent, high-quality novel +views from camera positions that are significantly different from the training +views. In this work, we introduce Neural Parametric Gaussians (NPGs) to take on +this challenge by imposing a two-stage approach: first, we fit a low-rank +neural deformation model, which then is used as regularization for non-rigid +reconstruction in the second stage. The first stage learns the object's +deformations such that it preserves consistency in novel views. The second +stage obtains high reconstruction quality by optimizing 3D Gaussians that are +driven by the coarse model. To this end, we introduce a local 3D Gaussian +representation, where temporally shared Gaussians are anchored in and deformed +by local oriented volumes. The resulting combined model can be rendered as +radiance fields, resulting in high-quality photo-realistic reconstructions of +the non-rigidly deforming objects. We demonstrate that NPGs achieve superior +results compared to previous works, especially in challenging scenarios with +few multi-view cues.",cs.CV,['cs.CV'] +AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing,Fan Yang · Tianyi Chen · XIAOSHENG HE · Zhongang Cai · Lei Yang · Si Wu · Guosheng Lin, ,https://arxiv.org/abs/2312.02209,,2312.02209.pdf,AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing,"Editable 3D-aware generation, which supports user-interacted editing, has +witnessed rapid development recently. However, existing editable 3D GANs either +fail to achieve high-accuracy local editing or suffer from huge computational +costs. We propose AttriHuman-3D, an editable 3D human generation model, which +address the aforementioned problems with attribute decomposition and indexing. +The core idea of the proposed model is to generate all attributes (e.g. human +body, hair, clothes and so on) in an overall attribute space with six feature +planes, which are then decomposed and manipulated with different attribute +indexes. To precisely extract features of different attributes from the +generated feature planes, we propose a novel attribute indexing method as well +as an orthogonal projection regularization to enhance the disentanglement. We +also introduce a hyper-latent training strategy and an attribute-specific +sampling strategy to avoid style entanglement and misleading punishment from +the discriminator. Our method allows users to interactively edit selected +attributes in the generated 3D human avatars while keeping others fixed. Both +qualitative and quantitative experiments demonstrate that our model provides a +strong disentanglement between different attributes, allows fine-grained image +editing and generates high-quality 3D human avatars.",cs.CV,['cs.CV'] +GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting,Yiwen Chen · Zilong Chen · Chi Zhang · Feng Wang · Xiaofeng Yang · Yikai Wang · Zhongang Cai · Lei Yang · Huaping Liu · Guosheng Lin, ,https://arxiv.org/abs/2311.14521,,2311.14521.pdf,GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting,"3D editing plays a crucial role in many areas such as gaming and virtual +reality. Traditional 3D editing methods, which rely on representations like +meshes and point clouds, often fall short in realistically depicting complex +scenes. On the other hand, methods based on implicit 3D representations, like +Neural Radiance Field (NeRF), render complex scenes effectively but suffer from +slow processing speeds and limited control over specific scene areas. In +response to these challenges, our paper presents GaussianEditor, an innovative +and efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D +representation. GaussianEditor enhances precision and control in editing +through our proposed Gaussian semantic tracing, which traces the editing target +throughout the training process. Additionally, we propose Hierarchical Gaussian +splatting (HGS) to achieve stabilized and fine results under stochastic +generative guidance from 2D diffusion models. We also develop editing +strategies for efficient object removal and integration, a challenging task for +existing methods. Our comprehensive experiments demonstrate GaussianEditor's +superior control, efficacy, and rapid performance, marking a significant +advancement in 3D editing. Project Page: +https://buaacyw.github.io/gaussian-editor/",cs.CV,['cs.CV'] +AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation,Qingping SUN · Yanjun Wang · Ailing Zeng · Wanqi Yin · Chen Wei · Wenjia Wang · Haiy Mei · Chi LEUNG · Ziwei Liu · Lei Yang · Zhongang Cai, ,https://arxiv.org/abs/2403.17934,,2403.17934.pdf,AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation,"Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh +recovery) involves the human body, hand, and expression estimation. Most +existing methods have tackled this task in a two-stage manner, first detecting +the human body part with an off-the-shelf detection model and inferring the +different human body parts individually. Despite the impressive results +achieved, these methods suffer from 1) loss of valuable contextual information +via cropping, 2) introducing distractions, and 3) lacking inter-association +among different persons and body parts, inevitably causing performance +degradation, especially for crowded scenes. To address these issues, we +introduce a novel all-in-one-stage framework, AiOS, for multiple expressive +human pose and shape recovery without an additional human detection step. +Specifically, our method is built upon DETR, which treats multi-person +whole-body mesh recovery task as a progressive set prediction problem with +various sequential detection. We devise the decoder tokens and extend them to +our task. Specifically, we first employ a human token to probe a human location +in the image and encode global features for each instance, which provides a +coarse location for the later transformer block. Then, we introduce a +joint-related token to probe the human joint in the image and encoder a +fine-grained local feature, which collaborates with the global feature to +regress the whole-body mesh. This straightforward but effective model +outperforms previous state-of-the-art methods by a 9% reduction in NMVE on +AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a +3% reduction in PVE on EgoBody.",cs.CV,['cs.CV'] +Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior,Wonseok Roh · Hwanhee Jung · Giljoo Nam · Jinseop Yeom · Hyunje Park · Sang Ho Yoon · Sangpil Kim, ,https://arxiv.org/abs/2311.12291,,2311.12291.pdf,Instance-aware 3D Semantic Segmentation powered by Shape Generators and Classifiers,"Existing 3D semantic segmentation methods rely on point-wise or voxel-wise +feature descriptors to output segmentation predictions. However, these +descriptors are often supervised at point or voxel level, leading to +segmentation models that can behave poorly at instance-level. In this paper, we +proposed a novel instance-aware approach for 3D semantic segmentation. Our +method combines several geometry processing tasks supervised at instance-level +to promote the consistency of the learned feature representation. Specifically, +our methods use shape generators and shape classifiers to perform shape +reconstruction and classification tasks for each shape instance. This enforces +the feature representation to faithfully encode both structural and local shape +information, with an awareness of shape instances. In the experiments, our +method significantly outperform existing approaches in 3D semantic segmentation +on several public benchmarks, such as Waymo Open Dataset, SemanticKITTI and +ScanNetV2.",cs.CV,['cs.CV'] +ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation,Suraj Patni · Aradhye Agarwal · Chetan Arora,https://ecodepth-iitd.github.io/,https://arxiv.org/abs/2403.18807,,2403.18807.pdf,ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation,"In the absence of parallax cues, a learning-based single image depth +estimation (SIDE) model relies heavily on shading and contextual cues in the +image. While this simplicity is attractive, it is necessary to train such +models on large and varied datasets, which are difficult to capture. It has +been shown that using embeddings from pre-trained foundational models, such as +CLIP, improves zero shot transfer in several applications. Taking inspiration +from this, in our paper we explore the use of global image priors generated +from a pre-trained ViT model to provide more detailed contextual information. +We argue that the embedding vector from a ViT model, pre-trained on a large +dataset, captures greater relevant information for SIDE than the usual route of +generating pseudo image captions, followed by CLIP based text embeddings. Based +on this idea, we propose a new SIDE model using a diffusion backbone which is +conditioned on ViT embeddings. Our proposed design establishes a new +state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of +0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on +KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to +0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model +trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) +over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, +18%, 45%, 9%) by ZoeDepth. The project page is available at +https://ecodepth-iitd.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs,Lin Song · Yukang Chen · Shuai Yang · Xiaohan Ding · Yixiao Ge · Ying-Cong Chen · Ying Shan, ,https://arxiv.org/abs/2405.18572,,2405.18572.pdf,Low-rank finetuning for LLMs: A fairness perspective,"Low-rank approximation techniques have become the de facto standard for +fine-tuning Large Language Models (LLMs) due to their reduced computational and +memory requirements. This paper investigates the effectiveness of these methods +in capturing the shift of fine-tuning datasets from the initial pre-trained +data distribution. Our findings reveal that there are cases in which low-rank +fine-tuning falls short in learning such shifts. This, in turn, produces +non-negligible side effects, especially when fine-tuning is adopted for +toxicity mitigation in pre-trained models, or in scenarios where it is +important to provide fair models. Through comprehensive empirical evidence on +several models, datasets, and tasks, we show that low-rank fine-tuning +inadvertently preserves undesirable biases and toxic behaviors. We also show +that this extends to sequential decision-making tasks, emphasizing the need for +careful evaluation to promote responsible LLMs development.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL']" +CG-HOI: Contact-Guided 3D Human-Object Interaction Generation,Christian Diller · Angela Dai,https://cg-hoi.christian-diller.de/#main,https://arxiv.org/abs/2311.16097v2,,2311.16097v2.pdf,CG-HOI: Contact-Guided 3D Human-Object Interaction Generation,"We propose CG-HOI, the first method to address the task of generating dynamic +3D human-object interactions (HOIs) from text. We model the motion of both +human and object in an interdependent fashion, as semantically rich human +motion rarely happens in isolation without any interactions. Our key insight is +that explicitly modeling contact between the human body surface and object +geometry can be used as strong proxy guidance, both during training and +inference. Using this guidance to bridge human and object motion enables +generating more realistic and physically plausible interaction sequences, where +the human body and corresponding object move in a coherent manner. Our method +first learns to model human motion, object motion, and contact in a joint +diffusion process, inter-correlated through cross-attention. We then leverage +this learned contact for guidance during inference to synthesize realistic and +coherent HOIs. Extensive evaluation shows that our joint contact-based +human-object interaction approach generates realistic and physically plausible +sequences, and we show two applications highlighting the capabilities of our +method. Conditioned on a given object trajectory, we can generate the +corresponding human motion without re-training, demonstrating strong +human-object interdependency learning. Our approach is also flexible, and can +be applied to static real-world 3D scene scans.",cs.CV,"['cs.CV', 'I.2.10; I.4.8; I.5.1; I.5.4']" +Digital Life Project: Autonomous 3D Characters with Social Intelligence,Zhongang Cai · Jianping Jiang · Zhongfei Qing · Xinying Guo · Mingyuan Zhang · Zhengyu Lin · Haiy Mei · Chen Wei · Wang Ruisi · Wanqi Yin · Liang Pan · Xiangyu Fan · Han Du · Peng Gao · Zhitao Yang · Yang Gao · Jiaqi Li · Tianxiang Ren · YuKun Wei · Xiaogang Wang · Chen Change Loy · Lei Yang · Ziwei Liu,https://digital-life-project.com/,https://arxiv.org/abs/2312.04547,,2312.04547.pdf,Digital Life Project: Autonomous 3D Characters with Social Intelligence,"In this work, we present Digital Life Project, a framework utilizing language +as the universal medium to build autonomous 3D characters, who are capable of +engaging in social interactions and expressing with articulated body motions, +thereby simulating life in a digital environment. Our framework comprises two +primary components: 1) SocioMind: a meticulously crafted digital brain that +models personalities with systematic few-shot exemplars, incorporates a +reflection process based on psychology principles, and emulates autonomy by +initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis +paradigm for controlling the character's digital body. It integrates motion +matching, a proven industry technique to ensure motion quality, with +cutting-edge advancements in motion generation for diversity. Extensive +experiments demonstrate that each module achieves state-of-the-art performance +in its respective domain. Collectively, they enable virtual characters to +initiate and sustain dialogues autonomously, while evolving their +socio-psychological states. Concurrently, these characters can perform +contextually relevant bodily movements. Additionally, a motion captioning +module further allows the virtual character to recognize and appropriately +respond to human players' actions. Homepage: https://digital-life-project.com/",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.HC']" +From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding,Yonglu Li · Xiaoqian Wu · Xinpeng Liu · Zehao Wang · Yiming Dou · Yikun Ji · Junyi Zhang · Yixing Li · Xudong LU · Jingru Tan · Cewu Lu, ,,https://synthical.com/article/a412be8a-adaa-450f-81ea-957ce0f2d0e4,,,,,nan +FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations,Christian Diller · Thomas Funkhouser · Angela Dai,https://future-human-3d.christian-diller.de/#main,https://arxiv.org/abs/2312.11972,,,Expressive Forecasting of 3D Whole-body Human Motions,"Human motion forecasting, with the goal of estimating future human behavior +over a period of time, is a fundamental task in many real-world applications. +However, existing works typically concentrate on predicting the major joints of +the human body without considering the delicate movements of the human hands. +In practical applications, hand gesture plays an important role in human +communication with the real world, and expresses the primary intention of human +beings. In this work, we are the first to formulate a whole-body human pose +forecasting task, which jointly predicts the future body and hand activities. +Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) +framework that aims to predict both coarse (body joints) and fine-grained +(gestures) activities collaboratively, enabling expressive and +cross-facilitated forecasting of 3D whole-body human motions. Specifically, our +model involves two key constituents: cross-context alignment (XCA) and +cross-context interaction (XCI). Considering the heterogeneous information +within the whole-body, XCA aims to align the latent features of various human +components, while XCI focuses on effectively capturing the context interaction +among the human components. We conduct extensive experiments on a +newly-introduced large-scale benchmark and achieve state-of-the-art +performance. The code is public for research purposes at +https://github.com/Dingpx/EAI.",cs.CV,['cs.CV'] +"UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition",Xiaohan Ding · Yiyuan Zhang · Yixiao Ge · Sijie Zhao · Lin Song · Xiangyu Yue · Ying Shan, ,https://arxiv.org/abs/2311.15599,,2311.15599.pdf,"UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition","Large-kernel convolutional neural networks (ConvNets) have recently received +extensive research attention, but two unresolved and critical issues demand +further investigation. 1) The architectures of existing large-kernel ConvNets +largely follow the design principles of conventional ConvNets or transformers, +while the architectural design for large-kernel ConvNets remains +under-addressed. 2) As transformers have dominated multiple modalities, it +remains to be investigated whether ConvNets also have a strong universal +perception ability in domains beyond vision. In this paper, we contribute from +two aspects. 1) We propose four architectural guidelines for designing +large-kernel ConvNets, the core of which is to exploit the essential +characteristics of large kernels that distinguish them from small kernels - +they can see wide without going deep. Following such guidelines, our proposed +large-kernel ConvNet shows leading performance in image recognition (ImageNet +accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%), +demonstrating better performance and higher speed than the recent powerful +competitors. 2) We discover large kernels are the key to unlocking the +exceptional performance of ConvNets in domains where they were originally not +proficient. With certain modality-related preprocessing approaches, the +proposed model achieves state-of-the-art performance on time-series forecasting +and audio recognition tasks even without modality-specific customization to the +architecture. All the code and models are publicly available on GitHub and +Huggingface.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning,Yuxiang Zhang · Hongwen Zhang · Liangxiao Hu · Jiajun Zhang · Hongwei Yi · Shengping Zhang · Yebin Liu,https://zhangyux15.github.io/ProxyCapV2,https://arxiv.org/abs/2307.01200,,2307.01200.pdf,ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning,"Learning-based approaches to monocular motion capture have recently shown +promising results by learning to regress in a data-driven manner. However, due +to the challenges in data collection and network designs, it remains +challenging for existing solutions to achieve real-time full-body capture while +being accurate in world space. In this work, we introduce ProxyCap, a +human-centric proxy-to-motion learning scheme to learn world-space motions from +a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy +data enables us to build a learning-based network with accurate world-space +supervision while also mitigating the generalization issues. For more accurate +and physically plausible predictions in world space, our network is designed to +learn human motions from a human-centric perspective, which enables the +understanding of the same motion captured with different camera trajectories. +Moreover, a contact-aware neural motion descent module is proposed in our +network so that it can be aware of foot-ground contact and motion misalignment +with the proxy observations. With the proposed learning-based solution, we +demonstrate the first real-time monocular full-body capture system with +plausible foot-ground contact in world space even using hand-held moving +cameras. Our project page is https://zhangyux15.github.io/ProxyCapV2.",cs.CV,['cs.CV'] +DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation,Chenyang Wang · Zerong Zheng · Tao Yu · Xiaoqian Lv · Bineng Zhong · Shengping Zhang · Liqiang Nie, ,https://arxiv.org/abs/2312.00853,,2312.00853.pdf,Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution,"Real-world low-resolution (LR) videos have diverse and complex degradations, +imposing great challenges on video super-resolution (VSR) algorithms to +reproduce their high-resolution (HR) counterparts with high quality. Recently, +the diffusion models have shown compelling performance in generating realistic +details for image restoration tasks. However, the diffusion process has +randomness, making it hard to control the contents of restored images. This +issue becomes more serious when applying diffusion models to VSR tasks because +temporal consistency is crucial to the perceptual quality of videos. In this +paper, we propose an effective real-world VSR algorithm by leveraging the +strength of pre-trained latent diffusion models. To ensure the content +consistency among adjacent frames, we exploit the temporal dynamics in LR +videos to guide the diffusion process by optimizing the latent sampling path +with a motion-guided loss, ensuring that the generated HR video maintains a +coherent and continuous visual flow. To further mitigate the discontinuity of +generated details, we insert temporal module to the decoder and fine-tune it +with an innovative sequence-oriented loss. The proposed motion-guided latent +diffusion (MGLD) based VSR algorithm achieves significantly better perceptual +quality than state-of-the-arts on real-world VSR benchmark datasets, validating +the effectiveness of the proposed model design and training strategies.",cs.CV,['cs.CV'] +DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis,Yuming Gu · Hongyi Xu · You Xie · Guoxian Song · Yichun Shi · Di Chang · Jing Yang · Linjie Luo,https://freedomgu.github.io/DiffPortrait3D/,https://arxiv.org/abs/2312.13016,,2312.13016.pdf,DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis,"We present DiffPortrait3D, a conditional diffusion model that is capable of +synthesizing 3D-consistent photo-realistic novel views from as few as a single +in-the-wild portrait. Specifically, given a single RGB input, we aim to +synthesize plausible but consistent facial details rendered from novel camera +views with retained both identity and facial expression. In lieu of +time-consuming optimization and fine-tuning, our zero-shot method generalizes +well to arbitrary face portraits with unposed camera views, extreme facial +expressions, and diverse artistic depictions. At its core, we leverage the +generative prior of 2D diffusion models pre-trained on large-scale image +datasets as our rendering backbone, while the denoising is guided with +disentangled attentive control of appearance and camera pose. To achieve this, +we first inject the appearance context from the reference image into the +self-attention layers of the frozen UNets. The rendering view is then +manipulated with a novel conditional control module that interprets the camera +pose by watching a condition image of a crossed subject from the same view. +Furthermore, we insert a trainable cross-view attention module to enhance view +consistency, which is further strengthened with a novel 3D-aware noise +generation process during inference. We demonstrate state-of-the-art results +both qualitatively and quantitatively on our challenging in-the-wild and +multi-view benchmarks.",cs.CV,['cs.CV'] +Tyche: Stochastic in Context Learning for Medical Image Segmentation,Marianne Rakic · Hallee Wong · Jose Javier Gonzalez Ortiz · Beth Cimini · John Guttag · Adrian V. Dalca, ,https://arxiv.org/abs/2401.13650,,2401.13650.pdf,Tyche: Stochastic In-Context Learning for Medical Image Segmentation,"Existing learning-based solutions to medical image segmentation have two +important shortcomings. First, for most new segmentation task, a new model has +to be trained or fine-tuned. This requires extensive resources and machine +learning expertise, and is therefore often infeasible for medical researchers +and clinicians. Second, most existing segmentation methods produce a single +deterministic segmentation mask for a given image. In practice however, there +is often considerable uncertainty about what constitutes the correct +segmentation, and different expert annotators will often segment the same image +differently. We tackle both of these problems with Tyche, a model that uses a +context set to generate stochastic predictions for previously unseen tasks +without the need to retrain. Tyche differs from other in-context segmentation +methods in two important ways. (1) We introduce a novel convolution block +architecture that enables interactions among predictions. (2) We introduce +in-context test-time augmentation, a new mechanism to provide prediction +stochasticity. When combined with appropriate model design and loss functions, +Tyche can predict a set of plausible diverse segmentation candidates for new or +unseen medical images and segmentation tasks without the need to retrain.",eess.IV,"['eess.IV', 'cs.CV']" +Incremental Residual Concept Bottleneck Models,Chenming Shang · Shiji Zhou · Hengyuan Zhang · Xinzhe Ni · Yujiu Yang · Yuwang Wang, ,https://arxiv.org/abs/2404.08978,,2404.08978.pdf,Incremental Residual Concept Bottleneck Models,"Concept Bottleneck Models (CBMs) map the black-box visual representations +extracted by deep neural networks onto a set of interpretable concepts and use +the concepts to make predictions, enhancing the transparency of the +decision-making process. Multimodal pre-trained models can match visual +representations with textual concept embeddings, allowing for obtaining the +interpretable concept bottleneck without the expertise concept annotations. +Recent research has focused on the concept bank establishment and the +high-quality concept selection. However, it is challenging to construct a +comprehensive concept bank through humans or large language models, which +severely limits the performance of CBMs. In this work, we propose the +Incremental Residual Concept Bottleneck Model (Res-CBM) to address the +challenge of concept completeness. Specifically, the residual concept +bottleneck model employs a set of optimizable vectors to complete missing +concepts, then the incremental concept discovery module converts the +complemented vectors with unclear meanings into potential concepts in the +candidate concept bank. Our approach can be applied to any user-defined concept +bank, as a post-hoc processing method to enhance the performance of any CBMs. +Furthermore, to measure the descriptive efficiency of CBMs, the Concept +Utilization Efficiency (CUE) metric is proposed. Experiments show that the +Res-CBM outperforms the current state-of-the-art methods in terms of both +accuracy and efficiency and achieves comparable performance to black-box models +across multiple datasets.",cs.LG,"['cs.LG', 'cs.AI']" +RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features,Geonho Bang · Kwangjin Choi · Jisong Kim · Dongsuk Kum · Jun Won Choi, ,https://arxiv.org/abs/2403.05061,,2403.05061.pdf,RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features,"The inherent noisy and sparse characteristics of radar data pose challenges +in finding effective representations for 3D object detection. In this paper, we +propose RadarDistill, a novel knowledge distillation (KD) method, which can +improve the representation of radar data by leveraging LiDAR data. RadarDistill +successfully transfers desirable characteristics of LiDAR features into radar +features using three key components: Cross-Modality Alignment (CMA), +Activation-based Feature Distillation (AFD), and Proposal-based Feature +Distillation (PFD). CMA enhances the density of radar features by employing +multiple layers of dilation operations, effectively addressing the challenge of +inefficient knowledge transfer from LiDAR to radar. AFD selectively transfers +knowledge based on regions of the LiDAR features, with a specific focus on +areas where activation intensity exceeds a predefined threshold. PFD similarly +guides the radar network to selectively mimic features from the LiDAR network +within the object proposals. Our comparative analyses conducted on the nuScenes +datasets demonstrate that RadarDistill achieves state-of-the-art (SOTA) +performance for radar-only object detection task, recording 20.5% in mAP and +43.7% in NDS. Also, RadarDistill significantly improves the performance of the +camera-radar fusion model.",cs.CV,['cs.CV'] +Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes,Gaurav Shrivastava · Abhinav Shrivastava,https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html,https://arxiv.org/abs/2401.14718,,2401.14718.pdf,A Survey on Video Prediction: From Deterministic to Generative Approaches,"Video prediction, a fundamental task in computer vision, aims to enable +models to generate sequences of future frames based on existing video content. +This task has garnered widespread application across various domains. In this +paper, we comprehensively survey both historical and contemporary works in this +field, encompassing the most widely used datasets and algorithms. Our survey +scrutinizes the challenges and evolving landscape of video prediction within +the realm of computer vision. We propose a novel taxonomy centered on the +stochastic nature of video prediction algorithms. This taxonomy accentuates the +gradual transition from deterministic to generative prediction methodologies, +underlining significant advancements and shifts in approach.",cs.CV,['cs.CV'] +Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring,Xin Gao · Tianheng Qiu · Xinyu Zhang · Hanlin Bai · Kang Liu · xuan huang · Hu Wei · Guoying Zhang · Huaping Liu, ,https://arxiv.org/abs/2401.00027,,2401.00027.pdf,Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring,"Coarse-to-fine schemes are widely used in traditional single-image motion +deblur; however, in the context of deep learning, existing multi-scale +algorithms not only require the use of complex modules for feature fusion of +low-scale RGB images and deep semantics, but also manually generate +low-resolution pairs of images that do not have sufficient confidence. In this +work, we propose a multi-scale network based on single-input and +multiple-outputs(SIMO) for motion deblurring. This simplifies the complexity of +algorithms based on a coarse-to-fine scheme. To alleviate restoration defects +impacting detail information brought about by using a multi-scale architecture, +we combine the characteristics of real-world blurring trajectories with a +learnable wavelet transform module to focus on the directional continuity and +frequency features of the step-by-step transitions between blurred images to +sharp images. In conclusion, we propose a multi-scale network with a learnable +discrete wavelet transform (MLWNet), which exhibits state-of-the-art +performance on multiple real-world deblurred datasets, in terms of both +subjective and objective quality as well as computational efficiency.",cs.CV,['cs.CV'] +Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners,Keon Hee Park · Kyungwoo Song · Gyeong-Moon Park, ,https://arxiv.org/abs/2404.02117,,,Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners,"Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model +to learn new classes incrementally without forgetting when only a few samples +for each class are given. FSCIL encounters two significant challenges: +catastrophic forgetting and overfitting, and these challenges have driven prior +studies to primarily rely on shallow models, such as ResNet-18. Even though +their limited capacity can mitigate both forgetting and overfitting issues, it +leads to inadequate knowledge transfer during few-shot incremental sessions. In +this paper, we argue that large models such as vision and language transformers +pre-trained on large datasets can be excellent few-shot incremental learners. +To this end, we propose a novel FSCIL framework called PriViLege, Pre-trained +Vision and Language transformers with prompting functions and knowledge +distillation. Our framework effectively addresses the challenges of +catastrophic forgetting and overfitting in large models through new pre-trained +knowledge tuning (PKT) and two losses: entropy-based divergence loss and +semantic knowledge distillation loss. Experimental results show that the +proposed PriViLege significantly outperforms the existing state-of-the-art +methods with a large margin, e.g., +9.38% in CUB200, +20.58% in CIFAR-100, and ++13.36% in miniImageNet. Our implementation code is available at +https://github.com/KHU-AGI/PriViLege.",cs.CV,['cs.CV'] +PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation,Jinfeng Xu · Siyuan Yang · Xianzhi Li · Yuan Tang · yixue Hao · Long Hu · Min Chen, ,https://arxiv.org/abs/2404.00979,,2404.00979.pdf,PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation,"Existing point cloud semantic segmentation networks cannot identify unknown +classes and update their knowledge, due to a closed-set and static perspective +of the real world, which would induce the intelligent agent to make bad +decisions. To address this problem, we propose a Probability-Driven Framework +(PDF) for open world semantic segmentation that includes (i) a lightweight +U-decoder branch to identify unknown classes by estimating the uncertainties, +(ii) a flexible pseudo-labeling scheme to supply geometry features along with +probability distribution features of unknown classes by generating pseudo +labels, and (iii) an incremental knowledge distillation strategy to incorporate +novel classes into the existing knowledge base gradually. Our framework enables +the model to behave like human beings, which could recognize unknown objects +and incrementally learn them with the corresponding knowledge. Experimental +results on the S3DIS and ScanNetv2 datasets demonstrate that the proposed PDF +outperforms other methods by a large margin in both important tasks of open +world semantic segmentation.",cs.CV,['cs.CV'] +Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle,Youtian Lin · Zuozhuo Dai · Siyu Zhu · Yao Yao, ,https://arxiv.org/abs/2312.03431,,2312.03431.pdf,Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle,"We introduce Gaussian-Flow, a novel point-based approach for fast dynamic +scene reconstruction and real-time rendering from both multi-view and monocular +videos. In contrast to the prevalent NeRF-based approaches hampered by slow +training and rendering speeds, our approach harnesses recent advancements in +point-based 3D Gaussian Splatting (3DGS). Specifically, a novel Dual-Domain +Deformation Model (DDDM) is proposed to explicitly model attribute deformations +of each Gaussian point, where the time-dependent residual of each attribute is +captured by a polynomial fitting in the time domain, and a Fourier series +fitting in the frequency domain. The proposed DDDM is capable of modeling +complex scene deformations across long video footage, eliminating the need for +training separate 3DGS for each frame or introducing an additional implicit +neural field to model 3D dynamics. Moreover, the explicit deformation modeling +for discretized Gaussian points ensures ultra-fast training and rendering of a +4D scene, which is comparable to the original 3DGS designed for static 3D +reconstruction. Our proposed approach showcases a substantial efficiency +improvement, achieving a $5\times$ faster training speed compared to the +per-frame 3DGS modeling. In addition, quantitative results demonstrate that the +proposed Gaussian-Flow significantly outperforms previous leading methods in +novel view rendering quality. Project page: +https://nju-3dv.github.io/projects/Gaussian-Flow",cs.CV,['cs.CV'] +Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation,Zhipeng Du · Miaojing Shi · Jiankang Deng,https://github.com/ZPDu/Boosting-Object-Detection-with-Zero-Shot-Day-Night-Domain-Adaptation,https://arxiv.org/abs/2312.01220,,2312.01220.pdf,Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation,"Detecting objects in low-light scenarios presents a persistent challenge, as +detectors trained on well-lit data exhibit significant performance degradation +on low-light data due to low visibility. Previous methods mitigate this issue +by exploring image enhancement or object detection techniques with real +low-light image datasets. However, the progress is impeded by the inherent +difficulties about collecting and annotating low-light images. To address this +challenge, we propose to boost low-light object detection with zero-shot +day-night domain adaptation, which aims to generalize a detector from well-lit +scenarios to low-light ones without requiring real low-light data. Revisiting +Retinex theory in the low-level vision, we first design a reflectance +representation learning module to learn Retinex-based illumination invariance +in images with a carefully designed illumination invariance reinforcement +strategy. Next, an interchange-redecomposition-coherence procedure is +introduced to improve over the vanilla Retinex image decomposition process by +performing two sequential image decompositions and introducing a +redecomposition cohering loss. Extensive experiments on ExDark, DARK FACE, and +CODaN datasets show strong low-light generalizability of our method. Our code +is available at https://github.com/ZPDu/DAI-Net.",cs.CV,['cs.CV'] +Clockwork Diffusion: Efficient Generation With Model-Step Distillation,Amirhossein Habibian · Amir Ghodrati · Noor Fathima · Guillaume Sautiere · Risheek Garrepalli · Fatih Porikli · Jens Petersen, ,https://arxiv.org/abs/2312.08128,,2312.08128.pdf,Clockwork Diffusion: Efficient Generation With Model-Step Distillation,"This work aims to improve the efficiency of text-to-image diffusion models. +While diffusion models use computationally expensive UNet-based denoising +operations in every generation step, we identify that not all operations are +equally relevant for the final output quality. In particular, we observe that +UNet layers operating on high-res feature maps are relatively sensitive to +small perturbations. In contrast, low-res feature maps influence the semantic +layout of the final image and can often be perturbed with no noticeable change +in the output. Based on this observation, we propose Clockwork Diffusion, a +method that periodically reuses computation from preceding denoising steps to +approximate low-res feature maps at one or more subsequent steps. For multiple +baselines, and for both text-to-image generation and image editing, we +demonstrate that Clockwork leads to comparable or improved perceptual scores +with drastically reduced computational complexity. As an example, for Stable +Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and +CLIP change.",cs.CV,['cs.CV'] +BEVSpread: Spread Voxel Pooling for Bird’s-Eye-View Representation in Vision-based Roadside 3D Object Detection,Wenjie Wang · Yehao Lu · Guangcong Zheng · Shuigenzhan · Xiaoqing Ye · Zichang Tan · Jingdong Wang · Gaoang Wang · Xi Li,https://github.com/DaTongjie/BEVSpread,https://arxiv.org/abs/2312.00633,,2312.00633.pdf,Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach,"3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a +prevalent approach in the field of autonomous driving. Despite the demonstrated +improvements in accuracy and velocity estimation compared to perspective view +methods, the deployment of BEV-based techniques in real-world autonomous +vehicles remains challenging. This is primarily due to their reliance on +vision-transformer (ViT) based architectures, which introduce quadratic +complexity with respect to the input resolution. To address this issue, we +propose an efficient BEV-based 3D detection framework called BEVENet, which +leverages a convolutional-only architectural design to circumvent the +limitations of ViT models while maintaining the effectiveness of BEV-based +methods. Our experiments show that BEVENet is 3$\times$ faster than +contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, +achieving a mean average precision (mAP) of 0.456 and a nuScenes detection +score (NDS) of 0.555 on the NuScenes validation dataset, with an inference +speed of 47.6 frames per second. To the best of our knowledge, this study +stands as the first to achieve such significant efficiency improvements for +BEV-based methods, highlighting their enhanced feasibility for real-world +autonomous driving applications.",cs.CV,"['cs.CV', 'cs.AI']" +GARField: Group Anything with Radiance Fields,Chung Min Kim · Mingxuan Wu · Justin Kerr · Ken Goldberg · Matthew Tancik · Angjoo Kanazawa, ,https://arxiv.org/abs/2401.09419,,2401.09419.pdf,GARField: Group Anything with Radiance Fields,"Grouping is inherently ambiguous due to the multiple levels of granularity in +which one can decompose a scene -- should the wheels of an excavator be +considered separate or part of the whole? We present Group Anything with +Radiance Fields (GARField), an approach for decomposing 3D scenes into a +hierarchy of semantically meaningful groups from posed image inputs. To do this +we embrace group ambiguity through physical scale: by optimizing a +scale-conditioned 3D affinity feature field, a point in the world can belong to +different groups of different sizes. We optimize this field from a set of 2D +masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine +hierarchy, using scale to consistently fuse conflicting masks from different +viewpoints. From this field we can derive a hierarchy of possible groupings via +automatic tree construction or user interaction. We evaluate GARField on a +variety of in-the-wild scenes and find it effectively extracts groups at many +levels: clusters of objects, objects, and various subparts. GARField inherently +represents multi-view consistent groupings and produces higher fidelity groups +than the input SAM masks. GARField's hierarchical grouping could have exciting +downstream applications such as 3D asset extraction or dynamic scene +understanding. See the project website at https://www.garfield.studio/",cs.CV,"['cs.CV', 'cs.GR']" +General Point Model Pretraining with Autoencoding and Autoregressive,Zhe Li · Zhangyang Gao · Cheng Tan · Bocheng Ren · Laurence Yang · Stan Z. Li, ,https://arxiv.org/abs/2310.16861,,2310.16861.pdf,General Point Model with Autoencoding and Autoregressive,"The pre-training architectures of large language models encompass various +types, including autoencoding models, autoregressive models, and +encoder-decoder models. We posit that any modality can potentially benefit from +a large language model, as long as it undergoes vector quantization to become +discrete tokens. Inspired by GLM, we propose a General Point Model (GPM) which +seamlessly integrates autoencoding and autoregressive tasks in point cloud +transformer. This model is versatile, allowing fine-tuning for downstream point +cloud representation tasks, as well as unconditional and conditional generation +tasks. GPM enhances masked prediction in autoencoding through various forms of +mask padding tasks, leading to improved performance in point cloud +understanding. Additionally, GPM demonstrates highly competitive results in +unconditional point cloud generation tasks, even exhibiting the potential for +conditional generation tasks by modifying the input's conditional information. +Compared to models like Point-BERT, MaskPoint and PointMAE, our GPM achieves +superior performance in point cloud understanding tasks. Furthermore, the +integration of autoregressive and autoencoding within the same transformer +underscores its versatility across different downstream tasks.",cs.LG,"['cs.LG', 'cs.CV']" +NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation,Minh-Tuan Tran · Trung Le · Xuan-May Le · Mehrtash Harandi · Quan Tran · Dinh Phung,https://github.com/tmtuan1307/NAYER,https://arxiv.org/abs/2310.00258,,2310.00258.pdf,NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation,"Data-Free Knowledge Distillation (DFKD) has made significant recent strides +by transferring knowledge from a teacher neural network to a student neural +network without accessing the original data. Nonetheless, existing approaches +encounter a significant challenge when attempting to generate samples from +random noise inputs, which inherently lack meaningful information. +Consequently, these models struggle to effectively map this noise to the +ground-truth sample distribution, resulting in prolonging training times and +low-quality outputs. In this paper, we propose a novel Noisy Layer Generation +method (NAYER) which relocates the random source from the input to a noisy +layer and utilizes the meaningful constant label-text embedding (LTE) as the +input. LTE is generated by using the language model once, and then it is stored +in memory for all subsequent training processes. The significance of LTE lies +in its ability to contain substantial meaningful inter-class information, +enabling the generation of high-quality samples with only a few training steps. +Simultaneously, the noisy layer plays a key role in addressing the issue of +diversity in sample generation by preventing the model from overemphasizing the +constrained label information. By reinitializing the noisy layer in each +iteration, we aim to facilitate the generation of diverse samples while still +retaining the method's efficiency, thanks to the ease of learning provided by +LTE. Experiments carried out on multiple datasets demonstrate that our NAYER +not only outperforms the state-of-the-art methods but also achieves speeds 5 to +15 times faster than previous approaches. The code is available at +https://github.com/tmtuan1307/nayer.",cs.CV,['cs.CV'] +MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning,Zhe Li · Laurence Yang · Bocheng Ren · Xin Nie · Zhangyang Gao · Cheng Tan · Stan Z. Li, ,https://arxiv.org/abs/2402.02045,,2402.02045.pdf,MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning,"The scarcity of annotated data has sparked significant interest in +unsupervised pre-training methods that leverage medical reports as auxiliary +signals for medical visual representation learning. However, existing research +overlooks the multi-granularity nature of medical visual representation and +lacks suitable contrastive learning techniques to improve the models' +generalizability across different granularities, leading to the +underutilization of image-text information. To address this, we propose MLIP, a +novel framework leveraging domain-specific medical knowledge as guiding signals +to integrate language information into the visual domain through image-text +contrastive learning. Our model includes global contrastive learning with our +designed divergence encoder, local token-knowledge-patch alignment contrastive +learning, and knowledge-guided category-level contrastive learning with expert +knowledge. Experimental evaluations reveal the efficacy of our model in +enhancing transfer performance for tasks such as image classification, object +detection, and semantic segmentation. Notably, MLIP surpasses state-of-the-art +methods even with limited annotated data, highlighting the potential of +multimodal pre-training in advancing medical representation learning.",cs.CV,['cs.CV'] +Inversion-Free Image Editing with Language-Guided Diffusion Models,Sihan Xu · Yidong Huang · Jiayi Pan · Ziqiao Ma · Joyce Chai,https://sled-group.github.io/InfEdit/,https://arxiv.org/abs/2312.04965,,2312.04965.pdf,Inversion-Free Image Editing with Natural Language,"Despite recent advances in inversion-based editing, text-guided image +manipulation remains challenging for diffusion models. The primary bottlenecks +include 1) the time-consuming nature of the inversion process; 2) the struggle +to balance consistency with accuracy; 3) the lack of compatibility with +efficient consistency sampling methods used in consistency models. To address +the above issues, we start by asking ourselves if the inversion process can be +eliminated for editing. We show that when the initial sample is known, a +special variance schedule reduces the denoising step to the same form as the +multi-step consistency sampling. We name this Denoising Diffusion Consistent +Model (DDCM), and note that it implies a virtual inversion strategy without +explicit inversion in sampling. We further unify the attention control +mechanisms in a tuning-free framework for text-guided editing. Combining them, +we present inversion-free editing (InfEdit), which allows for consistent and +faithful editing for both rigid and non-rigid semantic changes, catering to +intricate modifications without compromising on the image's integrity and +explicit inversion. Through extensive experiments, InfEdit shows strong +performance in various editing tasks and also maintains a seamless workflow +(less than 3 seconds on one single A40), demonstrating the potential for +real-time applications. Project Page: https://sled-group.github.io/InfEdit/",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling,Miguel Fainstein · Viviana Siless · Emmanuel Iarussi,https://lia-ditella.github.io/DUDF/,https://arxiv.org/abs/2402.08876,,2402.08876.pdf,DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling,"In recent years, there has been a growing interest in training Neural +Networks to approximate Unsigned Distance Fields (UDFs) for representing open +surfaces in the context of 3D reconstruction. However, UDFs are +non-differentiable at the zero level set which leads to significant errors in +distances and gradients, generally resulting in fragmented and discontinuous +surfaces. In this paper, we propose to learn a hyperbolic scaling of the +unsigned distance field, which defines a new Eikonal problem with distinct +boundary conditions. This allows our formulation to integrate seamlessly with +state-of-the-art continuously differentiable implicit neural representation +networks, largely applied in the literature to represent signed distance +fields. Our approach not only addresses the challenge of open surface +representation but also demonstrates significant improvement in reconstruction +quality and training performance. Moreover, the unlocked field's +differentiability allows the accurate computation of essential topological +properties such as normal directions and curvatures, pervasive in downstream +tasks such as rendering. Through extensive experiments, we validate our +approach across various data sets and against competitive baselines. The +results demonstrate enhanced accuracy and up to an order of magnitude increase +in speed compared to previous methods.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'I.2.10; I.4.10; I.3.7']" +RoMa: Robust Dense Feature Matching,Johan Edstedt · Qiyu Sun · Georg Bökman · Mårten Wadenbäck · Michael Felsberg,https://parskatt.github.io/RoMa/,https://arxiv.org/html/2305.15404v2,,2305.15404v2.pdf,RoMa: Robust Dense Feature Matching,"Feature matching is an important computer vision task that involves +estimating correspondences between two images of a 3D scene, and dense methods +estimate all such correspondences. The aim is to learn a robust model, i.e., a +model able to match under challenging real-world changes. In this work, we +propose such a model, leveraging frozen pretrained features from the foundation +model DINOv2. Although these features are significantly more robust than local +features trained from scratch, they are inherently coarse. We therefore combine +them with specialized ConvNet fine features, creating a precisely localizable +feature pyramid. To further improve robustness, we propose a tailored +transformer match decoder that predicts anchor probabilities, which enables it +to express multimodality. Finally, we propose an improved loss formulation +through regression-by-classification with subsequent robust regression. We +conduct a comprehensive set of experiments that show that our method, RoMa, +achieves significant gains, setting a new state-of-the-art. In particular, we +achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is +provided at https://github.com/Parskatt/RoMa",cs.CV,['cs.CV'] +Harnessing Large Language Models for Training-free Video Anomaly Detection,Luca Zanella · Willi Menapace · Massimiliano Mancini · Yiming Wang · Elisa Ricci, ,,https://paperswithcode.com/paper/harnessing-large-language-models-for-training,,,,,nan +Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models,Nikita Starodubcev · Dmitry Baranchuk · Artem Fedorov · Artem Babenko, ,https://arxiv.org/abs/2312.10835,,2312.10835.pdf,Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models,"Knowledge distillation methods have recently shown to be a promising +direction to speedup the synthesis of large-scale diffusion models by requiring +only a few inference steps. While several powerful distillation methods were +recently proposed, the overall quality of student samples is typically lower +compared to the teacher ones, which hinders their practical usage. In this +work, we investigate the relative quality of samples produced by the teacher +text-to-image diffusion model and its distilled student version. As our main +empirical finding, we discover that a noticeable portion of student samples +exhibit superior fidelity compared to the teacher ones, despite the +""approximate"" nature of the student. Based on this finding, we propose an +adaptive collaboration between student and teacher diffusion models for +effective text-to-image synthesis. Specifically, the distilled model produces +the initial sample, and then an oracle decides whether it needs further +improvements with a slow teacher model. Extensive experiments demonstrate that +the designed pipeline surpasses state-of-the-art text-to-image alternatives for +various inference budgets in terms of human preference. Furthermore, the +proposed approach can be naturally used in popular applications such as +text-guided image editing and controllable generation.",cs.CV,['cs.CV'] +Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering,Zhaohe Liao · Jiangtong Li · Li Niu · Liqing Zhang, ,,https://dl.acm.org/doi/abs/10.1145/3581783.3613909,,,,,nan +$360+x$: A Panoptic Multi-modal Scene Understanding Dataset,Hao Chen · Yuqi Hou · Chenyuan Qu · Irene Testini · Xiaohan Hong · Jianbo Jiao,https://x360dataset.github.io/,https://arxiv.org/abs/2404.00989,,2404.00989.pdf,360+x: A Panoptic Multi-modal Scene Understanding Dataset,"Human perception of the world is shaped by a multitude of viewpoints and +modalities. While many existing datasets focus on scene understanding from a +certain perspective (e.g. egocentric or third-person views), our dataset offers +a panoptic perspective (i.e. multiple viewpoints with multiple data +modalities). Specifically, we encapsulate third-person panoramic and front +views, as well as egocentric monocular/binocular views with rich modalities +including video, multi-channel audio, directional binaural delay, location data +and textual scene descriptions within each scene captured, presenting +comprehensive observation of the world. Figure 1 offers a glimpse of all 28 +scene categories of our 360+x dataset. To the best of our knowledge, this is +the first database that covers multiple viewpoints with multiple data +modalities to mimic how daily information is accessed in the real world. +Through our benchmark analysis, we presented 5 different scene understanding +tasks on the proposed 360+x dataset to evaluate the impact and benefit of each +data modality and perspective in panoptic scene understanding. We hope this +unique dataset could broaden the scope of comprehensive scene understanding and +encourage the community to approach these problems from more diverse +perspectives.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM', 'cs.SD', 'eess.AS']" +Text-Enhanced Data-free Approach for Federated Class-Incremental Learning,Minh-Tuan Tran · Trung Le · Xuan-May Le · Mehrtash Harandi · Dinh Phung,https://github.com/tmtuan1307/LANDER,https://arxiv.org/abs/2403.14101,,2403.14101.pdf,Text-Enhanced Data-free Approach for Federated Class-Incremental Learning,"Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal +issue, involving the dynamic addition of new classes in the context of +federated learning. In this field, Data-Free Knowledge Transfer (DFKT) plays a +crucial role in addressing catastrophic forgetting and data privacy problems. +However, prior approaches lack the crucial synergy between DFKT and the model +training phases, causing DFKT to encounter difficulties in generating +high-quality data from a non-anchored latent space of the old task model. In +this paper, we introduce LANDER (Label Text Centered Data-Free Knowledge +Transfer) to address this issue by utilizing label text embeddings (LTE) +produced by pretrained language models. Specifically, during the model training +phase, our approach treats LTE as anchor points and constrains the feature +embeddings of corresponding training samples around them, enriching the +surrounding area with more meaningful information. In the DFKT phase, by using +these LTE anchors, LANDER can synthesize more meaningful samples, thereby +effectively addressing the forgetting problem. Additionally, instead of tightly +constraining embeddings toward the anchor, the Bounding Loss is introduced to +encourage sample embeddings to remain flexible within a defined radius. This +approach preserves the natural differences in sample embeddings and mitigates +the embedding overlap caused by heterogeneous federated settings. Extensive +experiments conducted on CIFAR100, Tiny-ImageNet, and ImageNet demonstrate that +LANDER significantly outperforms previous methods and achieves state-of-the-art +performance in FCIL. The code is available at +https://github.com/tmtuan1307/lander.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']" +Rethinking Boundary Discontinuity Problem for Oriented Object Detection,Hang Xu · Xinyuan Liu · Haonan Xu · Yike Ma · Zunjie Zhu · Chenggang Yan · Feng Dai,https://github.com/hangxu-cv/cvpr24acm,,https://ieeexplore.ieee.org/abstract/document/10475581,,,,,nan +GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation,Tong Wu · Guandao Yang · Zhibing Li · Kai Zhang · Ziwei Liu · Leonidas Guibas · Dahua Lin · Gordon Wetzstein, ,https://arxiv.org/abs/2401.04092,,2401.04092.pdf,GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation,"Despite recent advances in text-to-3D generative methods, there is a notable +absence of reliable evaluation metrics. Existing metrics usually focus on a +single criterion each, such as how well the asset aligned with the input text. +These metrics lack the flexibility to generalize to different evaluation +criteria and might not align well with human preferences. Conducting user +preference studies is an alternative that offers both adaptability and +human-aligned results. User studies, however, can be very expensive to scale. +This paper presents an automatic, versatile, and human-aligned evaluation +metric for text-to-3D generative models. To this end, we first develop a prompt +generator using GPT-4V to generate evaluating prompts, which serve as input to +compare text-to-3D models. We further design a method instructing GPT-4V to +compare two 3D assets according to user-defined criteria. Finally, we use these +pairwise comparison results to assign these models Elo ratings. Experimental +results suggest our metric strongly align with human preference across +different evaluation criteria.",cs.CV,['cs.CV'] +Adversarial Text to Continuous Image Generation,Kilichbek Haydarov · Aashiq Muhamed · Xiaoqian Shen · Jovana Lazarevic · Ivan Skorokhodov · Chamuditha Jayanga Galappaththige · Mohamed Elhoseiny, ,https://arxiv.org/abs/2312.14440,,2312.14440.pdf,Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks,"The widespread use of Text-to-Image (T2I) models in content generation +requires careful examination of their safety, including their robustness to +adversarial attacks. Despite extensive research on adversarial attacks, the +reasons for their effectiveness remain underexplored. This paper presents an +empirical study on adversarial attacks against T2I models, focusing on +analyzing factors associated with attack success rates (ASR). We introduce a +new attack objective - entity swapping using adversarial suffixes and two +gradient-based attack algorithms. Human and automatic evaluations reveal the +asymmetric nature of ASRs on entity swap: for example, it is easier to replace +""human"" with ""robot"" in the prompt ""a human dancing in the rain."" with an +adversarial suffix, but the reverse replacement is significantly harder. We +further propose probing metrics to establish indicative signals from the +model's beliefs to the adversarial ASR. We identify conditions that result in a +success probability of 60% for adversarial attacks and others where this +likelihood drops below 5%.",cs.LG,"['cs.LG', 'cs.CR']" +Contextrast: Contextual Contrastive Learning for Semantic Segmentation,Changki Sung · Wanhee Kim · Jungho An · WooJu Lee · Hyungtae Lim · Hyun Myung, ,https://arxiv.org/abs/2404.10633,,2404.10633.pdf,Contextrast: Contextual Contrastive Learning for Semantic Segmentation,"Despite great improvements in semantic segmentation, challenges persist +because of the lack of local/global contexts and the relationship between them. +In this paper, we propose Contextrast, a contrastive learning-based semantic +segmentation method that allows to capture local/global contexts and comprehend +their relationships. Our proposed method comprises two parts: a) contextual +contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling. +Contextual contrastive learning obtains local/global context from multi-scale +feature aggregation and inter/intra-relationship of features for better +discrimination capabilities. Meanwhile, BANE sampling selects embedding +features along the boundaries of incorrectly predicted regions to employ them +as harder negative samples on our contrastive learning, resolving segmentation +issues along the boundary region by exploiting fine-grained details. We +demonstrate that our Contextrast substantially enhances the performance of +semantic segmentation networks, outperforming state-of-the-art contrastive +learning approaches on diverse public datasets, e.g. Cityscapes, CamVid, +PASCAL-C, COCO-Stuff, and ADE20K, without an increase in computational cost +during inference.",cs.CV,['cs.CV'] +DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception,Yibo Wang · Ruiyuan Gao · Kai Chen · Kaiqiang Zhou · Yingjie CAI · Lanqing Hong · Zhenguo Li · Lihui Jiang · Dit-Yan Yeung · Qiang Xu · Kai Zhang, ,https://arxiv.org/abs/2403.13304,,2403.13304.pdf,DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception,"Current perceptive models heavily depend on resource-intensive datasets, +prompting the need for innovative solutions. Leveraging recent advances in +diffusion models, synthetic data, by constructing image inputs from various +annotations, proves beneficial for downstream tasks. While prior methods have +separately addressed generative and perceptive models, DetDiffusion, for the +first time, harmonizes both, tackling the challenges in generating effective +data for perceptive models. To enhance image generation with perceptive models, +we introduce perception-aware loss (P.A. loss) through segmentation, improving +both quality and controllability. To boost the performance of specific +perceptive models, our method customizes data augmentation by extracting and +utilizing perception-aware attribute (P.A. Attr) during generation. +Experimental results from the object detection task highlight DetDiffusion's +superior performance, establishing a new state-of-the-art in layout-guided +generation. Furthermore, image syntheses from DetDiffusion can effectively +augment training data, significantly enhancing downstream detection +performance.",cs.CV,['cs.CV'] +Boosting Image Quality Assessment through Efficient Transformer Adaptation with Local Feature Enhancement,Kangmin Xu · Liang Liao · Jing Xiao · Chaofeng Chen · Haoning Wu · Qiong Yan · Weisi Lin, ,https://arxiv.org/abs/2308.12001,,2308.12001.pdf,Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment,"Image Quality Assessment (IQA) constitutes a fundamental task within the +field of computer vision, yet it remains an unresolved challenge, owing to the +intricate distortion conditions, diverse image contents, and limited +availability of data. Recently, the community has witnessed the emergence of +numerous large-scale pretrained foundation models, which greatly benefit from +dramatically increased data and parameter capacities. However, it remains an +open problem whether the scaling law in high-level tasks is also applicable to +IQA task which is closely related to low-level clues. In this paper, we +demonstrate that with proper injection of local distortion features, a larger +pretrained and fixed foundation model performs better in IQA tasks. +Specifically, for the lack of local distortion structure and inductive bias of +vision transformer (ViT), alongside the large-scale pretrained ViT, we use +another pretrained convolution neural network (CNN), which is well known for +capturing the local structure, to extract multi-scale image features. Further, +we propose a local distortion extractor to obtain local distortion features +from the pretrained CNN and a local distortion injector to inject the local +distortion features into ViT. By only training the extractor and injector, our +method can benefit from the rich knowledge in the powerful foundation models +and achieve state-of-the-art performance on popular IQA datasets, indicating +that IQA is not only a low-level problem but also benefits from stronger +high-level features drawn from large-scale pretrained models.",cs.CV,['cs.CV'] +Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples,Junhao Dong · Piotr Koniusz · Junxi Chen · Z. Wang · Yew-Soon Ong, ,,https://www.a-star.edu.sg/cfar/news/news/features/10-papers-accepted-at-cvpr-2024,,,,,nan +Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners,Junhao Dong · Piotr Koniusz · Junxi Chen · Xiaohua Xie · Yew-Soon Ong, ,,https://openreview.net/forum?id=h9TTpQdGKJ,,,,,nan +Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition,Kyle Buettner · Sina Malakouti · Xiang Li · Adriana Kovashka,https://krbuettner.github.io/GeoKnowledgePrompting/,https://arxiv.org/abs/2401.01482,,2401.01482.pdf,Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition,"Existing object recognition models have been shown to lack robustness in +diverse geographical scenarios due to domain shifts in design and context. +Class representations need to be adapted to more accurately reflect an object +concept under these shifts. In the absence of training data from target +geographies, we hypothesize that geographically diverse descriptive knowledge +of categories can enhance robustness. For this purpose, we explore the +feasibility of probing a large language model for geography-based object +knowledge, and we examine the effects of integrating knowledge into zero-shot +and learnable soft prompting with CLIP. Within this exploration, we propose +geography knowledge regularization to ensure that soft prompts trained on a +source set of geographies generalize to an unseen target set. Accuracy gains +over prompting baselines on DollarStreet while training only on Europe data are +up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas, and +4.6 overall +on the hardest classes. Competitive performance is shown vs. few-shot target +training, and analysis is provided to direct future study of geographical +robustness.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']" +Clustering for Protein Representation Learning,Ruijie Quan · Wenguan Wang · Fan Ma · Hehe Fan · Yi Yang, ,https://arxiv.org/abs/2404.00254,,2404.00254.pdf,Clustering for Protein Representation Learning,"Protein representation learning is a challenging task that aims to capture +the structure and function of proteins from their amino acid sequences. +Previous methods largely ignored the fact that not all amino acids are equally +important for protein folding and activity. In this article, we propose a +neural clustering framework that can automatically discover the critical +components of a protein by considering both its primary and tertiary structure +information. Our framework treats a protein as a graph, where each node +represents an amino acid and each edge represents a spatial or sequential +connection between amino acids. We then apply an iterative clustering strategy +to group the nodes into clusters based on their 1D and 3D positions and assign +scores to each cluster. We select the highest-scoring clusters and use their +medoid nodes for the next iteration of clustering, until we obtain a +hierarchical and informative representation of the protein. We evaluate on four +protein-related tasks: protein fold classification, enzyme reaction +classification, gene ontology term prediction, and enzyme commission number +prediction. Experimental results demonstrate that our method achieves +state-of-the-art performance.",cs.LG,"['cs.LG', 'cs.CE', 'q-bio.BM', 'q-bio.QM']" +Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment,Alvi Md Ishmam · Chris Thomas, ,https://arxiv.org/abs/2402.06659,,2402.06659.pdf,Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models,"Vision-Language Models (VLMs) excel in generating textual responses from +visual inputs, yet their versatility raises significant security concerns. This +study takes the first step in exposing VLMs' susceptibility to data poisoning +attacks that can manipulate responses to innocuous, everyday prompts. We +introduce Shadowcast, a stealthy data poisoning attack method where poison +samples are visually indistinguishable from benign images with matching texts. +Shadowcast demonstrates effectiveness in two attack types. The first is Label +Attack, tricking VLMs into misidentifying class labels, such as confusing +Donald Trump for Joe Biden. The second is Persuasion Attack, which leverages +VLMs' text generation capabilities to craft narratives, such as portraying junk +food as health food, through persuasive and seemingly rational descriptions. We +show that Shadowcast are highly effective in achieving attacker's intentions +using as few as 50 poison samples. Moreover, these poison samples remain +effective across various prompts and are transferable across different VLM +architectures in the black-box setting. This work reveals how poisoned VLMs can +generate convincing yet deceptive misinformation and underscores the importance +of data quality for responsible deployments of VLMs. Our code is available at: +https://github.com/umd-huang-lab/VLM-Poisoning.",cs.CR,"['cs.CR', 'cs.AI', 'cs.LG']" +Structured Model Probing: Empowering Efficient Transfer Learning by Structured Regularization,Zhi-Fan Wu · Chaojie Mao · Xue Wang · Jianwen Jiang · Yiliang Lv · Rong Jin, ,https://arxiv.org/abs/2403.10799,,2403.10799.pdf,Efficient Pruning of Large Language Model with Adaptive Estimation Fusion,"Large language models (LLMs) have become crucial for many generative +downstream tasks, leading to an inevitable trend and significant challenge to +deploy them efficiently on resource-constrained devices. Structured pruning is +a widely used method to address this challenge. However, when dealing with the +complex structure of the multiple decoder layers, general methods often employ +common estimation approaches for pruning. These approaches lead to a decline in +accuracy for specific downstream tasks. In this paper, we introduce a simple +yet efficient method that adaptively models the importance of each +substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained +estimations based on the results from complex and multilayer structures. All +aspects of our design seamlessly integrate into the endto-end pruning +framework. Our experimental results, compared with state-of-the-art methods on +mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, +2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, +respectively.",cs.CL,"['cs.CL', 'cs.AI', 'cs.LG']" +Artist-Friendly Relightable and Animatable Neural Heads,Yingyan Xu · Prashanth Chandran · Sebastian Weiss · Markus Gross · Gaspard Zoss · Derek Bradley,https://studios.disneyresearch.com/2024/06/03/artist-friendly-relightable-and-animatable-neural-heads/,https://arxiv.org/abs/2312.03420,,2312.03420.pdf,Artist-Friendly Relightable and Animatable Neural Heads,"An increasingly common approach for creating photo-realistic digital avatars +is through the use of volumetric neural fields. The original neural radiance +field (NeRF) allowed for impressive novel view synthesis of static heads when +trained on a set of multi-view images, and follow up methods showed that these +neural representations can be extended to dynamic avatars. Recently, new +variants also surpassed the usual drawback of baked-in illumination in neural +representations, showing that static neural avatars can be relit in any +environment. In this work we simultaneously tackle both the motion and +illumination problem, proposing a new method for relightable and animatable +neural heads. Our method builds on a proven dynamic avatar approach based on a +mixture of volumetric primitives, combined with a recently-proposed lightweight +hardware setup for relightable neural fields, and includes a novel architecture +that allows relighting dynamic neural avatars performing unseen expressions in +any environment, even with nearfield illumination and viewpoints.",cs.CV,"['cs.CV', 'cs.GR']" +Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity,Ruijie Quan · Wenguan Wang · Zhibo Tian · Fan Ma · Yi Yang, ,https://arxiv.org/abs/2403.20022,,2403.20022.pdf,Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity,"Reconstructing the viewed images from human brain activity bridges human and +computer vision through the Brain-Computer Interface. The inherent variability +in brain function between individuals leads existing literature to focus on +acquiring separate models for each individual using their respective brain +signal data, ignoring commonalities between these data. In this article, we +devise Psychometry, an omnifit model for reconstructing images from functional +Magnetic Resonance Imaging (fMRI) obtained from different subjects. Psychometry +incorporates an omni mixture-of-experts (Omni MoE) module where all the experts +work together to capture the inter-subject commonalities, while each expert +associated with subject-specific parameters copes with the individual +differences. Moreover, Psychometry is equipped with a retrieval-enhanced +inference strategy, termed Ecphory, which aims to enhance the learned fMRI +representation via retrieving from prestored subject-specific memories. These +designs collectively render Psychometry omnifit and efficient, enabling it to +capture both inter-subject commonality and individual specificity across +subjects. As a result, the enhanced fMRI representations serve as conditional +signals to guide a generation model to reconstruct high-quality and realistic +images, establishing Psychometry as state-of-the-art in terms of both +high-level and low-level metrics.",cs.CV,['cs.CV'] +JointSQ: Joint Sparsification-Quantization for Distributed Learning,Weiying Xie · Haowei Li · Ma Jitao · Yunsong Li · Jie Lei · donglai Liu · Leyuan Fang, ,,https://www.semanticscholar.org/paper/Joint-Sparsification-and-Quantization-for-Wireless-Su-Wang/f940a77cd570b121a727d59cd249513930cd830a,,,,,nan +PAPR in Motion: Seamless Point-level 3D Scene Interpolation,Shichong Peng · Yanshu Zhang · Ke Li, ,https://arxiv.org/abs/2307.11086,,2307.11086.pdf,PAPR: Proximity Attention Point Rendering,"Learning accurate and parsimonious point cloud representations of scene +surfaces from scratch remains a challenge in 3D representation learning. +Existing point-based methods often suffer from the vanishing gradient problem +or require a large number of points to accurately model scene geometry and +texture. To address these limitations, we propose Proximity Attention Point +Rendering (PAPR), a novel method that consists of a point-based scene +representation and a differentiable renderer. Our scene representation uses a +point cloud where each point is characterized by its spatial position, +influence score, and view-independent feature vector. The renderer selects the +relevant points for each ray and produces accurate colours using their +associated features. PAPR effectively learns point cloud positions to represent +the correct scene geometry, even when the initialization drastically differs +from the target geometry. Notably, our method captures fine texture details +while using only a parsimonious set of points. We also demonstrate four +practical applications of our method: zero-shot geometry editing, object +manipulation, texture transfer, and exposure control. More results and code are +available on our project website at https://zvict.github.io/papr/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG', 'cs.NE']" +Anatomically Constrained Implicit Face Models,Prashanth Chandran · Gaspard Zoss, ,https://arxiv.org/abs/2312.07538,,2312.07538.pdf,Anatomically Constrained Implicit Face Models,"Coordinate based implicit neural representations have gained rapid popularity +in recent years as they have been successfully used in image, geometry and +scene modeling tasks. In this work, we present a novel use case for such +implicit representations in the context of learning anatomically constrained +face models. Actor specific anatomically constrained face models are the state +of the art in both facial performance capture and performance retargeting. +Despite their practical success, these anatomical models are slow to evaluate +and often require extensive data capture to be built. We propose the anatomical +implicit face model; an ensemble of implicit neural networks that jointly learn +to model the facial anatomy and the skin surface with high-fidelity, and can +readily be used as a drop in replacement to conventional blendshape models. +Given an arbitrary set of skin surface meshes of an actor and only a neutral +shape with estimated skull and jaw bones, our method can recover a dense +anatomical substructure which constrains every point on the facial surface. We +demonstrate the usefulness of our approach in several tasks ranging from shape +fitting, shape editing, and performance retargeting.",cs.GR,"['cs.GR', 'cs.CV']" +EscherNet: A Generative Model for Scalable View Synthesis,Xin Kong · Shikun Liu · Xiaoyang Lyu · Marwan Taher · Xiaojuan Qi · Andrew J. Davison,https://kxhit.github.io/EscherNet,https://arxiv.org/abs/2402.03908,,2402.03908.pdf,EscherNet: A Generative Model for Scalable View Synthesis,"We introduce EscherNet, a multi-view conditioned diffusion model for view +synthesis. EscherNet learns implicit and generative 3D representations coupled +with a specialised camera positional encoding, allowing precise and continuous +relative control of the camera transformation between an arbitrary number of +reference and target views. EscherNet offers exceptional generality, +flexibility, and scalability in view synthesis -- it can generate more than 100 +consistent target views simultaneously on a single consumer-grade GPU, despite +being trained with a fixed number of 3 reference views to 3 target views. As a +result, EscherNet not only addresses zero-shot novel view synthesis, but also +naturally unifies single- and multi-image 3D reconstruction, combining these +diverse tasks into a single, cohesive framework. Our extensive experiments +demonstrate that EscherNet achieves state-of-the-art performance in multiple +benchmarks, even when compared to methods specifically tailored for each +individual problem. This remarkable versatility opens up new directions for +designing scalable neural architectures for 3D vision. Project page: +https://kxhit.github.io/EscherNet.",cs.CV,['cs.CV'] +Revisiting Adversarial Training under Long-Tailed Distributions,Xinli Yue · Ningping Mou · Qian Wang · Lingchen Zhao,https://github.com/NISPLab/AT-BSL,https://arxiv.org/abs/2403.10073,,2403.10073.pdf,Revisiting Adversarial Training under Long-Tailed Distributions,"Deep neural networks are vulnerable to adversarial attacks, often leading to +erroneous outputs. Adversarial training has been recognized as one of the most +effective methods to counter such attacks. However, existing adversarial +training techniques have predominantly been tested on balanced datasets, +whereas real-world data often exhibit a long-tailed distribution, casting doubt +on the efficacy of these methods in practical scenarios. + In this paper, we delve into adversarial training under long-tailed +distributions. Through an analysis of the previous work ""RoBal"", we discover +that utilizing Balanced Softmax Loss alone can achieve performance comparable +to the complete RoBal approach while significantly reducing training overheads. +Additionally, we reveal that, similar to uniform distributions, adversarial +training under long-tailed distributions also suffers from robust overfitting. +To address this, we explore data augmentation as a solution and unexpectedly +discover that, unlike results obtained with balanced data, data augmentation +not only effectively alleviates robust overfitting but also significantly +improves robustness. We further investigate the reasons behind the improvement +of robustness through data augmentation and identify that it is attributable to +the increased diversity of examples. Extensive experiments further corroborate +that data augmentation alone can significantly improve robustness. Finally, +building on these findings, we demonstrate that compared to RoBal, the +combination of BSL and data augmentation leads to a +6.66% improvement in model +robustness under AutoAttack on CIFAR-10-LT. Our code is available at +https://github.com/NISPLab/AT-BSL .",cs.CV,['cs.CV'] +UniGS: Unified Representation for Image Generation and Segmentation,Lu Qi · Lehan Yang · Weidong Guo · Yu Xu · Bo Du · Varun Jampani · Ming-Hsuan Yang, ,https://arxiv.org/abs/2312.01985,,2312.01985.pdf,UniGS: Unified Representation for Image Generation and Segmentation,"This paper introduces a novel unified representation of diffusion models for +image generation and segmentation. Specifically, we use a colormap to represent +entity-level masks, addressing the challenge of varying entity numbers while +aligning the representation closely with the image RGB domain. Two novel +modules, including the location-aware color palette and progressive dichotomy +module, are proposed to support our mask representation. On the one hand, a +location-aware palette guarantees the colors' consistency to entities' +locations. On the other hand, the progressive dichotomy module can efficiently +decode the synthesized colormap to high-quality entity-level masks in a +depth-first binary search without knowing the cluster numbers. To tackle the +issue of lacking large-scale segmentation training data, we employ an +inpainting pipeline and then improve the flexibility of diffusion models across +various tasks, including inpainting, image synthesis, referring segmentation, +and entity segmentation. Comprehensive experiments validate the efficiency of +our approach, demonstrating comparable segmentation mask quality to +state-of-the-art and adaptability to multiple tasks. The code will be released +at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.",cs.CV,['cs.CV'] +Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation,guo · Tianwei Lin, ,https://arxiv.org/abs/2312.10113,,2312.10113.pdf,Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation,"Recently, diffusion-based methods, like InstructPix2Pix (IP2P), have achieved +effective instruction-based image editing, requiring only natural language +instructions from the user. However, these methods often inadvertently alter +unintended areas and struggle with multi-instruction editing, resulting in +compromised outcomes. To address these issues, we introduce the Focus on Your +Instruction (FoI), a method designed to ensure precise and harmonious editing +across multiple instructions without extra training or test-time optimization. +In the FoI, we primarily emphasize two aspects: (1) precisely extracting +regions of interest for each instruction and (2) guiding the denoising process +to concentrate within these regions of interest. For the first objective, we +identify the implicit grounding capability of IP2P from the cross-attention +between instruction and image, then develop an effective mask extraction +method. For the second objective, we introduce a cross attention modulation +module for rough isolation of target editing regions and unrelated regions. +Additionally, we introduce a mask-guided disentangle sampling strategy to +further ensure clear region isolation. Experimental results demonstrate that +FoI surpasses existing methods in both quantitative and qualitative +evaluations, especially excelling in multi-instruction editing task.",cs.CV,['cs.CV'] +MorpheuS: Neural Dynamic 360$^{\circ}$ Surface Reconstruction from Monocular RGB-D Video,Hengyi Wang · Jingwen Wang · Lourdes Agapito,https://hengyiwang.github.io/projects/morpheus.html,https://arxiv.org/abs/2312.00778,,2312.00778.pdf,MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video,"Neural rendering has demonstrated remarkable success in dynamic scene +reconstruction. Thanks to the expressiveness of neural representations, prior +works can accurately capture the motion and achieve high-fidelity +reconstruction of the target object. Despite this, real-world video scenarios +often feature large unobserved regions where neural representations struggle to +achieve realistic completion. To tackle this challenge, we introduce MorpheuS, +a framework for dynamic 360{\deg} surface reconstruction from a casually +captured RGB-D video. Our approach models the target scene as a canonical field +that encodes its geometry and appearance, in conjunction with a deformation +field that warps points from the current frame to the canonical space. We +leverage a view-dependent diffusion prior and distill knowledge from it to +achieve realistic completion of unobserved regions. Experimental results on +various real-world and synthetic datasets show that our method can achieve +high-fidelity 360{\deg} surface reconstruction of a deformable object from a +monocular RGB-D video.",cs.CV,['cs.CV'] +DiffusionLight: Light Probes for Free by Painting a Chrome Ball,Pakkapon Phongthawee · Worameth Chinchuthakun · Nontaphat Sinsunthithet · Varun Jampani · Amit Raj · Pramook Khungurn · Supasorn Suwajanakorn,https://diffusionlight.github.io/,https://arxiv.org/abs/2312.09168v2,,2312.09168v2.pdf,DiffusionLight: Light Probes for Free by Painting a Chrome Ball,"We present a simple yet effective technique to estimate lighting in a single +input image. Current techniques rely heavily on HDR panorama datasets to train +neural networks to regress an input with limited field-of-view to a full +environment map. However, these approaches often struggle with real-world, +uncontrolled settings due to the limited diversity and size of their datasets. +To address this problem, we leverage diffusion models trained on billions of +standard images to render a chrome ball into the input image. Despite its +simplicity, this task remains challenging: the diffusion models often insert +incorrect or inconsistent objects and cannot readily generate images in HDR +format. Our research uncovers a surprising relationship between the appearance +of chrome balls and the initial diffusion noise map, which we utilize to +consistently generate high-quality chrome balls. We further fine-tune an LDR +difusion model (Stable Diffusion XL) with LoRA, enabling it to perform exposure +bracketing for HDR light estimation. Our method produces convincing light +estimates across diverse settings and demonstrates superior generalization to +in-the-wild scenarios.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG', 'I.3.3; I.4.8']" +JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments,Duy Tho Le · Chenhui Gou · Stavya Datta · Hengcan Shi · Ian Reid · Jianfei Cai · Hamid Rezatofighi, ,https://arxiv.org/abs/2404.01686,,2404.01686.pdf,JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments,"Autonomous robot systems have attracted increasing research attention in +recent years, where environment understanding is a crucial step for robot +navigation, human-robot interaction, and decision. Real-world robot systems +usually collect visual data from multiple sensors and are required to recognize +numerous objects and their movements in complex human-crowded settings. +Traditional benchmarks, with their reliance on single sensors and limited +object classes and scenarios, fail to provide the comprehensive environmental +understanding robots need for accurate navigation, interaction, and +decision-making. As an extension of JRDB dataset, we unveil JRDB-PanoTrack, a +novel open-world panoptic segmentation and tracking benchmark, towards more +comprehensive environmental perception. JRDB-PanoTrack includes (1) various +data involving indoor and outdoor crowded scenes, as well as comprehensive 2D +and 3D synchronized data modalities; (2) high-quality 2D spatial panoptic +segmentation and temporal tracking annotations, with additional 3D label +projections for further spatial understanding; (3) diverse object classes for +closed- and open-world recognition benchmarks, with OSPA-based metrics for +evaluation. Extensive evaluation of leading methods shows significant +challenges posed by our dataset.",cs.CV,['cs.CV'] +Adaptive VIO: Deep Visual-Inertial Odometry with Online Continual Learning,Youqi Pan · Wugen Zhou · Yingdian Cao · Hongbin Zha, ,https://arxiv.org/html/2308.11228v2,,2308.11228v2.pdf,VIO-DualProNet: Visual-Inertial Odometry with Learning Based Process Noise Covariance,"Visual-inertial odometry (VIO) is a vital technique used in robotics, +augmented reality, and autonomous vehicles. It combines visual and inertial +measurements to accurately estimate position and orientation. Existing VIO +methods assume a fixed noise covariance for the inertial uncertainty. However, +accurately determining in real-time the noise variance of the inertial sensors +presents a significant challenge as the uncertainty changes throughout the +operation leading to suboptimal performance and reduced accuracy. To circumvent +this, we propose VIO-DualProNet, a novel approach that utilizes deep learning +methods to dynamically estimate the inertial noise uncertainty in real-time. By +designing and training a deep neural network to predict inertial noise +uncertainty using only inertial sensor measurements, and integrating it into +the VINS-Mono algorithm, we demonstrate a substantial improvement in accuracy +and robustness, enhancing VIO performance and potentially benefiting other +VIO-based systems for precise localization and mapping across diverse +conditions.",cs.RO,"['cs.RO', 'cs.SY', 'eess.SY']" +ZERO-IG: Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images,Yiqi Shi · Duo Liu · Liguo Zhang · Ye Tian · Xuezhi Xia · fuxiaojing,https://github.com/Doyle59217/ZeroIG,https://arxiv.org/abs/2311.02995,,2311.02995.pdf,Zero-Shot Enhancement of Low-Light Image Based on Retinex Decomposition,"Two difficulties here make low-light image enhancement a challenging task; +firstly, it needs to consider not only luminance restoration but also image +contrast, image denoising and color distortion issues simultaneously. Second, +the effectiveness of existing low-light enhancement methods depends on paired +or unpaired training data with poor generalization performance. + To solve these difficult problems, we propose in this paper a new +learning-based Retinex decomposition of zero-shot low-light enhancement method, +called ZERRINNet. To this end, we first designed the N-Net network, together +with the noise loss term, to be used for denoising the original low-light image +by estimating the noise of the low-light image. Moreover, RI-Net is used to +estimate the reflection component and illumination component, and in order to +solve the color distortion and contrast, we use the texture loss term and +segmented smoothing loss to constrain the reflection component and illumination +component. Finally, our method is a zero-reference enhancement method that is +not affected by the training data of paired and unpaired datasets, so our +generalization performance is greatly improved, and in the paper, we have +effectively validated it with a homemade real-life low-light dataset and +additionally with advanced vision tasks, such as face detection, target +recognition, and instance segmentation. We conducted comparative experiments on +a large number of public datasets and the results show that the performance of +our method is competitive compared to the current state-of-the-art methods. The +code is available at:https://github.com/liwenchao0615/ZERRINNet",cs.CV,"['cs.CV', 'cs.GR']" +Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models,Gianni Franchi · Olivier Laurent · Maxence Leguéry · Andrei Bursuc · Andrea Pilzer · Angela Yao,https://ensta-u2is-ai.github.io/ABNN-Make-me-a-BNN/,https://arxiv.org/abs/2312.15297,,2312.15297.pdf,Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models,"Deep Neural Networks (DNNs) are powerful tools for various computer vision +tasks, yet they often struggle with reliable uncertainty quantification - a +critical requirement for real-world applications. Bayesian Neural Networks +(BNN) are equipped for uncertainty estimation but cannot scale to large DNNs +that are highly unstable to train. To address this challenge, we introduce the +Adaptable Bayesian Neural Network (ABNN), a simple and scalable strategy to +seamlessly transform DNNs into BNNs in a post-hoc manner with minimal +computational and training overheads. ABNN preserves the main predictive +properties of DNNs while enhancing their uncertainty quantification abilities +through simple BNN adaptation layers (attached to normalization layers) and a +few fine-tuning steps on pre-trained models. We conduct extensive experiments +across multiple datasets for image classification and semantic segmentation +tasks, and our results demonstrate that ABNN achieves state-of-the-art +performance without the computational budget typically associated with ensemble +methods.",cs.LG,"['cs.LG', 'cs.CV', 'stat.ML']" +OpenBias: Open-set Bias Detection in Text-to-Image Generative Models,Moreno D'Incà · Elia Peruzzo · Massimiliano Mancini · Dejia Xu · Vidit Goel · Xingqian Xu · Zhangyang Wang · Humphrey Shi · Nicu Sebe,https://github.com/Picsart-AI-Research/OpenBias,https://arxiv.org/abs/2404.07990v1,,2404.07990v1.pdf,OpenBias: Open-set Bias Detection in Text-to-Image Generative Models,"Text-to-image generative models are becoming increasingly popular and +accessible to the general public. As these models see large-scale deployments, +it is necessary to deeply investigate their safety and fairness to not +disseminate and perpetuate any kind of biases. However, existing works focus on +detecting closed sets of biases defined a priori, limiting the studies to +well-known concepts. In this paper, we tackle the challenge of open-set bias +detection in text-to-image generative models presenting OpenBias, a new +pipeline that identifies and quantifies the severity of biases agnostically, +without access to any precompiled set. OpenBias has three stages. In the first +phase, we leverage a Large Language Model (LLM) to propose biases given a set +of captions. Secondly, the target generative model produces images using the +same set of captions. Lastly, a Vision Question Answering model recognizes the +presence and extent of the previously proposed biases. We study the behavior of +Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated +before. Via quantitative experiments, we demonstrate that OpenBias agrees with +current closed-set bias detection methods and human judgement.",cs.CV,"['cs.CV', 'cs.AI']" +Depth-Aware Concealed Crop Detection in Dense Agricultural Scenes,Liqiong Wang · Jinyu Yang · Yanfu Zhang · Fangyi Wang · Feng Zheng,https://github.com/Kki2Eve/RISNet,,https://www.mdpi.com/1424-8220/24/6/1942,,,,,nan +GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement,Linfang Zheng · Tze Ho Elden Tse · Chen Wang · Yinghan Sun · Hua Chen · Aleš Leonardis · Wei Zhang · Hyung Jin Chang,https://lynne-zheng-linfang.github.io/georef.github.io/,https://arxiv.org/abs/2404.11139v1,,2404.11139v1.pdf,GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement,"Object pose refinement is essential for robust object pose estimation. +Previous work has made significant progress towards instance-level object pose +refinement. Yet, category-level pose refinement is a more challenging problem +due to large shape variations within a category and the discrepancies between +the target object and the shape prior. To address these challenges, we +introduce a novel architecture for category-level object pose refinement. Our +approach integrates an HS-layer and learnable affine transformations, which +aims to enhance the extraction and alignment of geometric information. +Additionally, we introduce a cross-cloud transformation mechanism that +efficiently merges diverse data sources. Finally, we push the limits of our +model by incorporating the shape prior information for translation and size +error prediction. We conducted extensive experiments to demonstrate the +effectiveness of the proposed framework. Through extensive quantitative +experiments, we demonstrate significant improvement over the baseline method by +a large margin across all metrics.",cs.CV,['cs.CV'] +Learning to Control Camera Exposure via Reinforcement Learning,Kyunghyun Lee · Ukcheol Shin · Byeong-Uk Lee,https://sites.google.com/view/drl-ae,https://arxiv.org/abs/2404.01636,,2404.01636.pdf,Learning to Control Camera Exposure via Reinforcement Learning,"Adjusting camera exposure in arbitrary lighting conditions is the first step +to ensure the functionality of computer vision applications. Poorly adjusted +camera exposure often leads to critical failure and performance degradation. +Traditional camera exposure control methods require multiple convergence steps +and time-consuming processes, making them unsuitable for dynamic lighting +conditions. In this paper, we propose a new camera exposure control framework +that rapidly controls camera exposure while performing real-time processing by +exploiting deep reinforcement learning. The proposed framework consists of four +contributions: 1) a simplified training ground to simulate real-world's diverse +and dynamic lighting changes, 2) flickering and image attribute-aware reward +design, along with lightweight state design for real-time processing, 3) a +static-to-dynamic lighting curriculum to gradually improve the agent's +exposure-adjusting capability, and 4) domain randomization techniques to +alleviate the limitation of the training ground and achieve seamless +generalization in the wild.As a result, our proposed method rapidly reaches a +desired exposure level within five steps with real-time processing (1 ms). +Also, the acquired images are well-exposed and show superiority in various +computer vision tasks, such as feature extraction and object detection.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO', 'cs.SY', 'eess.SY']" +Differentiable Neural Surface Refinement for Transparent Objects,Weijian Deng · Dylan Campbell · Chunyi Sun · Shubham Kanitkar · Matthew Shaffer · Stephen Gould,https://weijiandeng.xyz/nsr,,https://dl.acm.org/doi/abs/10.1145/3610548.3618236,,,,,nan +Discovering and Mitigating Visual Biases through Keyword Explanation,Younghyun Kim · Sangwoo Mo · Minkyu Kim · Kyungmin Lee · Jaeho Lee · Jinwoo Shin, ,,https://effl.postech.ac.kr/docs/research/papers/,,,,,nan +MiKASA: Multi-Key-Anchor Scene-Aware Transformer for 3D Visual Grounding,Chun-Peng Chang · Shaoxiang Wang · Alain Pagani · Didier Stricker, ,https://arxiv.org/abs/2403.03077,,,MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding,"3D visual grounding involves matching natural language descriptions with +their corresponding objects in 3D spaces. Existing methods often face +challenges with accuracy in object recognition and struggle in interpreting +complex linguistic queries, particularly with descriptions that involve +multiple anchors or are view-dependent. In response, we present the MiKASA +(Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model +integrates a self-attention-based scene-aware object encoder and an original +multi-key-anchor technique, enhancing object recognition accuracy and the +understanding of spatial relationships. Furthermore, MiKASA improves the +explainability of decision-making, facilitating error diagnosis. Our model +achieves the highest overall accuracy in the Referit3D challenge for both the +Sr3D and Nr3D datasets, particularly excelling by a large margin in categories +that require viewpoint-dependent descriptions.",cs.CV,['cs.CV'] +Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3),Tsu-Ching Hsiao · Hao-Wei Chen · Hsuan-Kung Yang · Chun-Yi Lee, ,https://arxiv.org/abs/2401.00029,,,6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation,"Estimating the 6D object pose from a single RGB image often involves noise +and indeterminacy due to challenges such as occlusions and cluttered +backgrounds. Meanwhile, diffusion models have shown appealing performance in +generating high-quality images from random noise with high indeterminacy +through step-by-step denoising. Inspired by their denoising capability, we +propose a novel diffusion-based framework (6D-Diff) to handle the noise and +indeterminacy in object pose estimation for better performance. In our +framework, to establish accurate 2D-3D correspondence, we formulate 2D +keypoints detection as a reverse diffusion (denoising) process. To facilitate +such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion +process and condition the reverse process on the object features. Extensive +experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our +framework.",cs.CV,['cs.CV'] +Text-guided Explorable Image Super-resolution,Kanchana Vaishnavi Gandikota · Paramanand Chandramouli, ,https://arxiv.org/abs/2403.01124,,2403.01124.pdf,Text-guided Explorable Image Super-resolution,"In this paper, we introduce the problem of zero-shot text-guided exploration +of the solutions to open-domain image super-resolution. Our goal is to allow +users to explore diverse, semantically accurate reconstructions that preserve +data consistency with the low-resolution inputs for different large +downsampling factors without explicitly training for these specific +degradations. We propose two approaches for zero-shot text-guided +super-resolution - i) modifying the generative process of text-to-image +\textit{T2I} diffusion models to promote consistency with low-resolution +inputs, and ii) incorporating language guidance into zero-shot diffusion-based +restoration methods. We show that the proposed approaches result in diverse +solutions that match the semantic meaning provided by the text prompt while +preserving data consistency with the degraded inputs. We evaluate the proposed +baselines for the task of extreme super-resolution and demonstrate advantages +in terms of restoration quality, diversity, and explorability of solutions.",cs.CV,['cs.CV'] +$CrowdDiff$: Multi-hypothesis Crowd Density Estimation using Diffusion Models,Yasiru Ranasinghe · Nithin Gopalakrishnan Nair · Wele Gedara Chaminda Bandara · Vishal M. Patel, ,,https://jarxiv.com/2024/04/05/crowddiff-multi-hypothesis-crowd-density-estimation-using-diffusion-models/,,,,,nan +Instruct-Imagen: Image Generation with Multi-modal Instruction,Hexiang Hu · Kelvin C.K. Chan · Yu-Chuan Su · Wenhu Chen · Yandong Li · Kihyuk Sohn · Yang Zhao · Xue Ben · William Cohen · Ming-Wei Chang · Xuhui Jia,https://instruct-imagen.github.io/,https://arxiv.org/abs/2401.01952,,2401.01952.pdf,Instruct-Imagen: Image Generation with Multi-modal Instruction,"This paper presents instruct-imagen, a model that tackles heterogeneous image +generation tasks and generalizes across unseen tasks. We introduce *multi-modal +instruction* for image generation, a task representation articulating a range +of generation intents with precision. It uses natural language to amalgamate +disparate modalities (e.g., text, edge, style, subject, etc.), such that +abundant generation intents can be standardized in a uniform format. + We then build instruct-imagen by fine-tuning a pre-trained text-to-image +diffusion model with a two-stage framework. First, we adapt the model using the +retrieval-augmented training, to enhance model's capabilities to ground its +generation on external multimodal context. Subsequently, we fine-tune the +adapted model on diverse image generation tasks that requires vision-language +understanding (e.g., subject-driven generation, etc.), each paired with a +multi-modal instruction encapsulating the task's essence. Human evaluation on +various image generation datasets reveals that instruct-imagen matches or +surpasses prior task-specific models in-domain and demonstrates promising +generalization to unseen and more complex tasks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']" +$MonoDiff$: Monocular 3D Object Detection and Pose Estimation with Diffusion Models,Yasiru Ranasinghe · Deepti Hegde · Vishal M. Patel, ,https://arxiv.org/abs/2403.18791,,,Object Pose Estimation via the Aggregation of Diffusion Features,"Estimating the pose of objects from images is a crucial task of 3D scene +understanding, and recent approaches have shown promising results on very large +benchmarks. However, these methods experience a significant performance drop +when dealing with unseen objects. We believe that it results from the limited +generalizability of image features. To address this problem, we have an +in-depth analysis on the features of diffusion models, e.g. Stable Diffusion, +which hold substantial potential for modeling unseen objects. Based on this +analysis, we then innovatively introduce these diffusion features for object +pose estimation. To achieve this, we propose three distinct architectures that +can effectively capture and aggregate diffusion features of different +granularity, greatly improving the generalizability of object pose estimation. +Our approach outperforms the state-of-the-art methods by a considerable margin +on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our +method achieves higher accuracy than the previous best arts on unseen objects: +98.2% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the +strong generalizability of our method. Our code is released at +https://github.com/Tianfu18/diff-feats-pose.",cs.CV,['cs.CV'] +Towards More Unified In-context Visual Understanding,Dianmo Sheng · Dongdong Chen · Zhentao Tan · Qiankun Liu · Qi Chu · Jianmin Bao · Tao Gong · Bin Liu · Shengwei Xu · Nenghai Yu, ,https://arxiv.org/abs/2312.02520v2,,2312.02520v2.pdf,Towards More Unified In-context Visual Understanding,"The rapid advancement of large language models (LLMs) has accelerated the +emergence of in-context learning (ICL) as a cutting-edge approach in the +natural language processing domain. Recently, ICL has been employed in visual +understanding tasks, such as semantic segmentation and image captioning, +yielding promising results. However, existing visual ICL framework can not +enable producing content across multiple modalities, which limits their +potential usage scenarios. To address this issue, we present a new ICL +framework for visual understanding with multi-modal output enabled. First, we +quantize and embed both text and visual prompt into a unified representational +space, structured as interleaved in-context sequences. Then a decoder-only +sparse transformer architecture is employed to perform generative modeling on +them, facilitating in-context learning. Thanks to this design, the model is +capable of handling in-context vision understanding tasks with multimodal +output in a unified pipeline.Experimental results demonstrate that our model +achieves competitive performance compared with specialized models and previous +ICL baselines. Overall, our research takes a further step toward unified +multimodal in-context learning.",cs.CV,['cs.CV'] +Compositional Chain-of-Thought Prompting for Large Multimodal Models,Chancharik Mitra · Brandon Huang · Trevor Darrell · Roei Herzig, ,https://arxiv.org/abs/2311.17076,,2311.17076.pdf,Compositional Chain-of-Thought Prompting for Large Multimodal Models,"The combination of strong visual backbones and Large Language Model (LLM) +reasoning has led to Large Multimodal Models (LMMs) becoming the current +standard for a wide range of vision and language (VL) tasks. However, recent +research has shown that even the most advanced LMMs still struggle to capture +aspects of compositional visual reasoning, such as attributes and relationships +between objects. One solution is to utilize scene graphs (SGs)--a formalization +of objects and their relations and attributes that has been extensively used as +a bridge between the visual and textual domains. Yet, scene graph data requires +scene graph annotations, which are expensive to collect and thus not easily +scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic +forgetting of the pretraining objective. To overcome this, inspired by +chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a +novel zero-shot Chain-of-Thought prompting method that utilizes SG +representations in order to extract compositional knowledge from an LMM. +Specifically, we first generate an SG using the LMM, and then use that SG in +the prompt to produce a response. Through extensive experiments, we find that +the proposed CCoT approach not only improves LMM performance on several vision +and language VL compositional benchmarks but also improves the performance of +several popular LMMs on general multimodal benchmarks, without the need for +fine-tuning or annotated ground-truth SGs. Code: +https://github.com/chancharikmitra/CCoT",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']" +"CAMixerSR: Only Details Need More ""Attention""",Yan Wang · Yi Liu · Shijie Zhao · Junlin Li · Li zhang,https://github.com/icandle/CAMixerSR,https://arxiv.org/abs/2402.19289v2,,2402.19289v2.pdf,"CAMixerSR: Only Details Need More ""Attention""","To satisfy the rapidly increasing demands on the large image (2K-8K) +super-resolution (SR), prevailing methods follow two independent tracks: 1) +accelerate existing networks by content-aware routing, and 2) design better +super-resolution networks via token mixer refining. Despite directness, they +encounter unavoidable defects (e.g., inflexible route or non-discriminative +processing) limiting further improvements of quality-complexity trade-off. To +erase the drawbacks, we integrate these schemes by proposing a content-aware +mixer (CAMixer), which assigns convolution for simple contexts and additional +deformable window-attention for sparse textures. Specifically, the CAMixer uses +a learnable predictor to generate multiple bootstraps, including offsets for +windows warping, a mask for classifying windows, and convolutional attentions +for endowing convolution with the dynamic property, which modulates attention +to include more useful textures self-adaptively and improves the representation +capability of convolution. We further introduce a global classification loss to +improve the accuracy of predictors. By simply stacking CAMixers, we obtain +CAMixerSR which achieves superior performance on large-image SR, lightweight +SR, and omnidirectional-image SR.",eess.IV,"['eess.IV', 'cs.CV']" +Geometrically-informed aggregation for zero-shot point cloud understanding,Guofeng Mei · Luigi Riz · Yiming Wang · Fabio Poiesi, ,https://arxiv.org/abs/2312.02244,,2312.02244.pdf,Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding,"Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language +Models (VLMs). Existing strategies directly map Vision-Language Models from 2D +pixels of rendered or captured views to 3D points, overlooking the inherent and +expressible point cloud geometric structure. Geometrically similar or close +regions can be exploited for bolstering point cloud understanding as they are +likely to share semantic information. To this end, we introduce the first +training-free aggregation technique that leverages the point cloud's 3D +geometric structure to improve the quality of the transferred Vision-Language +Models. Our approach operates iteratively, performing local-to-global +aggregation based on geometric and semantic point-level reasoning. We benchmark +our approach on three downstream tasks, including classification, part +segmentation, and semantic segmentation, with a variety of datasets +representing both synthetic/real-world, and indoor/outdoor scenarios. Our +approach achieves new state-of-the-art results in all benchmarks. Our approach +operates iteratively, performing local-to-global aggregation based on geometric +and semantic point-level reasoning. Code and dataset are available at +https://luigiriz.github.io/geoze-website/",cs.CV,['cs.CV'] +CrossKD: Cross-Head Knowledge Distillation for Dense Object Detection,JiaBao Wang · yuming chen · Zhaohui Zheng · Xiang Li · Ming-Ming Cheng · Qibin Hou,https://github.com/jbwang1997/CrossKD,https://arxiv.org/abs/2306.11369,,2306.11369.pdf,CrossKD: Cross-Head Knowledge Distillation for Object Detection,"Knowledge Distillation (KD) has been validated as an effective model +compression technique for learning compact object detectors. Existing +state-of-the-art KD methods for object detection are mostly based on feature +imitation. In this paper, we present a general and effective prediction +mimicking distillation scheme, called CrossKD, which delivers the intermediate +features of the student's detection head to the teacher's detection head. The +resulting cross-head predictions are then forced to mimic the teacher's +predictions. This manner relieves the student's head from receiving +contradictory supervision signals from the annotations and the teacher's +predictions, greatly improving the student's detection performance. Moreover, +as mimicking the teacher's predictions is the target of KD, CrossKD offers more +task-oriented information in contrast with feature imitation. On MS COCO, with +only prediction mimicking losses applied, our CrossKD boosts the average +precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, +outperforming all existing KD methods. In addition, our method also works well +when distilling detectors with heterogeneous backbones. Code is available at +https://github.com/jbwang1997/CrossKD.",cs.CV,['cs.CV'] +DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos,Arjun Balasingam · Joseph Chandler · Chenning Li · Zhoutong Zhang · Hari Balakrishnan,https://drivetrack.csail.mit.edu/,https://arxiv.org/abs/2312.09523,,2312.09523.pdf,DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos,"This paper presents DriveTrack, a new benchmark and data generation framework +for long-range keypoint tracking in real-world videos. DriveTrack is motivated +by the observation that the accuracy of state-of-the-art trackers depends +strongly on visual attributes around the selected keypoints, such as texture +and lighting. The problem is that these artifacts are especially pronounced in +real-world videos, but these trackers are unable to train on such scenes due to +a dearth of annotations. DriveTrack bridges this gap by building a framework to +automatically annotate point tracks on autonomous driving datasets. We release +a dataset consisting of 1 billion point tracks across 24 hours of video, which +is seven orders of magnitude greater than prior real-world benchmarks and on +par with the scale of synthetic benchmarks. DriveTrack unlocks new use cases +for point tracking in real-world videos. First, we show that fine-tuning +keypoint trackers on DriveTrack improves accuracy on real-world scenes by up to +7%. Second, we analyze the sensitivity of trackers to visual artifacts in real +scenes and motivate the idea of running assistive keypoint selectors alongside +trackers.",cs.CV,['cs.CV'] +CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model,Jianhao Zeng · Dan Song · Weizhi Nie · Hongshuo Tian · Tongtong Wang · An-An Liu,https://zengjianhao.github.io/CAT-DM,https://arxiv.org/abs/2311.18405,,2311.18405.pdf,CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model,"Generative Adversarial Networks (GANs) dominate the research field in +image-based virtual try-on, but have not resolved problems such as unnatural +deformation of garments and the blurry generation quality. While the generative +quality of diffusion models is impressive, achieving controllability poses a +significant challenge when applying it to virtual try-on and multiple denoising +iterations limit its potential for real-time applications. In this paper, we +propose Controllable Accelerated virtual Try-on with Diffusion Model (CAT-DM). +To enhance the controllability, a basic diffusion-based virtual try-on network +is designed, which utilizes ControlNet to introduce additional control +conditions and improves the feature extraction of garment images. In terms of +acceleration, CAT-DM initiates a reverse denoising process with an implicit +distribution generated by a pre-trained GAN-based model. Compared with previous +try-on methods based on diffusion models, CAT-DM not only retains the pattern +and texture details of the inshop garment but also reduces the sampling steps +without compromising generation quality. Extensive experiments demonstrate the +superiority of CAT-DM against both GANbased and diffusion-based methods in +producing more realistic images and accurately reproducing garment patterns.",cs.CV,['cs.CV'] +Free3D: Consistent Novel View Synthesis without 3D Representation,Chuanxia Zheng · Andrea Vedaldi,https://chuanxiaz.com/free3d/,https://arxiv.org/abs/2312.04551,,2312.04551.pdf,Free3D: Consistent Novel View Synthesis without 3D Representation,"We introduce Free3D, a simple accurate method for monocular open-set novel +view synthesis (NVS). Similar to Zero-1-to-3, we start from a pre-trained 2D +image generator for generalization, and fine-tune it for NVS. Compared to other +works that took a similar approach, we obtain significant improvements without +resorting to an explicit 3D representation, which is slow and memory-consuming, +and without training an additional network for 3D reconstruction. Our key +contribution is to improve the way the target camera pose is encoded in the +network, which we do by introducing a new ray conditioning normalization (RCN) +layer. The latter injects pose information in the underlying 2D image generator +by telling each pixel its viewing direction. We further improve multi-view +consistency by using light-weight multi-view attention layers and by sharing +generation noise between the different views. We train Free3D on the Objaverse +dataset and demonstrate excellent generalization to new categories in new +datasets, including OmniObject3D and GSO. The project page is available at +https://chuanxiaz.com/free3d/.",cs.CV,['cs.CV'] +InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields,Dongqing Wang · Tong Zhang · Alaa Abboud · Sabine Süsstrunk, ,https://arxiv.org/html/2401.05335v1,,2401.05335v1.pdf,InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes,"We introduce InseRF, a novel method for generative object insertion in the +NeRF reconstructions of 3D scenes. Based on a user-provided textual description +and a 2D bounding box in a reference viewpoint, InseRF generates new objects in +3D scenes. Recently, methods for 3D scene editing have been profoundly +transformed, owing to the use of strong priors of text-to-image diffusion +models in 3D generative modeling. Existing methods are mostly effective in +editing 3D scenes via style and appearance changes or removing existing +objects. Generating new objects, however, remains a challenge for such methods, +which we address in this study. Specifically, we propose grounding the 3D +object insertion to a 2D object insertion in a reference view of the scene. The +2D edit is then lifted to 3D using a single-view object reconstruction method. +The reconstructed object is then inserted into the scene, guided by the priors +of monocular depth estimation methods. We evaluate our method on various 3D +scenes and provide an in-depth analysis of the proposed components. Our +experiments with generative insertion of objects in several 3D scenes indicate +the effectiveness of our method compared to the existing methods. InseRF is +capable of controllable and 3D-consistent object insertion without requiring +explicit 3D information as input. Please visit our project page at +https://mohamad-shahbazi.github.io/inserf.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']" +Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation,Fahimeh Hosseini Noohdani · Parsa Hosseini · Aryan Yazdan Parast · Hamidreza Araghi · Mahdieh Baghshah, ,https://arxiv.org/abs/2402.18919,,2402.18919.pdf,Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation,"While standard Empirical Risk Minimization (ERM) training is proven effective +for image classification on in-distribution data, it fails to perform well on +out-of-distribution samples. One of the main sources of distribution shift for +image classification is the compositional nature of images. Specifically, in +addition to the main object or component(s) determining the label, some other +image components usually exist, which may lead to the shift of input +distribution between train and test environments. More importantly, these +components may have spurious correlations with the label. To address this +issue, we propose Decompose-and-Compose (DaC), which improves robustness to +correlation shift by a compositional approach based on combining elements of +images. Based on our observations, models trained with ERM usually highly +attend to either the causal components or the components having a high spurious +correlation with the label (especially in datapoints on which models have a +high confidence). In fact, according to the amount of spurious correlation and +the easiness of classification based on the causal or non-causal components, +the model usually attends to one of these more (on samples with high +confidence). Following this, we first try to identify the causal components of +images using class activation maps of models trained with ERM. Afterward, we +intervene on images by combining them and retraining the model on the augmented +data, including the counterfactual ones. Along with its high interpretability, +this work proposes a group-balancing method by intervening on images without +requiring group labels or information regarding the spurious features during +training. The method has an overall better worst group accuracy compared to +previous methods with the same amount of supervision on the group labels in +correlation shift.",cs.CV,"['cs.CV', 'cs.LG']"